Project: Identify Customer Segments

In this project, you will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

This notebook will help you complete this task by providing a framework within which you will perform your analysis steps. In each step of the project, you will see some text describing the subtask that you will perform, followed by one or more code cells for you to complete your work. Feel free to add additional code and markdown cells as you go along so that you can explore everything in precise chunks. The code cells provided in the base template will outline only the major tasks, and will usually not be enough to cover all of the minor tasks that comprise it.

It should be noted that while there will be precise guidelines on how you should handle certain tasks in the project, there will also be places where an exact specification is not provided. There will be times in the project where you will need to make and justify your own decisions on how to treat the data. These are places where there may not be only one way to handle the data. In real-life tasks, there may be many valid ways to approach an analysis task. One of the most important things you can do is clearly document your approach so that other scientists can understand the decisions you've made.

At the end of most sections, there will be a Markdown cell labeled Discussion. In these cells, you will report your findings for the completed section, as well as document the decisions that you made in your approach to each subtask. Your project will be evaluated not just on the code used to complete the tasks outlined, but also your communication about your observations and conclusions at each stage.

In [1]:
# Make sure any changes to custom packages can be reflected immediately 
# in the notebook without kernel restart
import autoreload
%load_ext autoreload
%autoreload 2

# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly_express as px
import pandas_profiling as pdp

# magic word for producing visualizations in notebook
%matplotlib inline

# Setup some nice defaults for dataframe display
pd.set_option("display.max_columns", 85)
pd.set_option("display.max_rows", 85)

# Given the expected size of our DataFrames, need to disable jedi for speeding up autocompletion
%config Completer.use_jedi = False

Step 0: Load the Data

There are four files associated with this project (not including this one):

  • Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
  • Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
  • Data_Dictionary.md: Detailed information file about the features in the provided datasets.
  • AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. You will use this information to cluster the general population into groups with similar demographic properties. Then, you will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.

To start off with, load in the demographics data for the general population into a pandas DataFrame, and do the same for the feature attributes summary. Note for all of the .csv data files in this project: they're semicolon (;) delimited, so you'll need an additional argument in your read_csv() call to read in the data properly. Also, considering the size of the main dataset, it may take some time for it to load completely.

Once the dataset is loaded, it's recommended that you take a little bit of time just browsing the general structure of the dataset and feature summary file. You'll be getting deep into the innards of the cleaning in the first major step of the project, so gaining some general familiarity can help you get your bearings.

In [2]:
# Load in the general demographics data.
gen_pop = pd.read_csv('data/Udacity_AZDIAS_Subset.csv', sep=';')
gen_pop.head()
Out[2]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GEBURTSJAHR GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ TITEL_KZ VERS_TYP ZABEOTYP ALTER_HH ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE KK_KUNDENTYP W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_BAUMAX KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 -1 2 1 2.0 3 4 3 5 5 3 4 0 10.0 0 -1 15.0 4.0 2.0 2.0 1.0 1.0 0 0 5.0 2 6 7 5 1 5 3 3 4 7 6 6 5 3 -1 NaN NaN -1 3 NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -1 1 2 5.0 1 5 2 5 4 5 1 1996 10.0 0 3 21.0 6.0 5.0 3.0 2.0 1.0 1 14 1.0 5 4 4 3 1 2 2 3 6 4 7 4 7 6 3 1.0 0.0 2 5 0.0 2.0 0.0 6.0 NaN 3.0 9.0 11.0 0.0 8.0 1.0 1992.0 W 4.0 8 8A 51 0.0 0.0 0.0 2.0 5.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 -1 3 2 3.0 1 4 1 2 3 5 1 1979 10.0 1 3 3.0 1.0 1.0 1.0 3.0 2.0 1 15 3.0 4 1 3 3 4 4 6 3 4 7 7 7 3 3 2 0.0 0.0 1 5 17.0 1.0 0.0 4.0 NaN 3.0 9.0 10.0 0.0 1.0 5.0 1992.0 W 2.0 4 4C 24 1.0 3.0 1.0 0.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2 4 2 2.0 4 2 5 2 1 2 6 1957 1.0 0 2 0.0 0.0 0.0 0.0 9.0 4.0 1 8 2.0 5 1 2 1 4 4 7 4 3 4 4 5 4 4 1 0.0 0.0 1 3 13.0 0.0 0.0 1.0 NaN NaN 9.0 1.0 0.0 1.0 4.0 1997.0 W 7.0 2 2A 12 4.0 1.0 0.0 0.0 1.0 4.0 4.0 2.0 6.0 4.0 0.0 4.0 1.0 0.0 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 -1 3 1 5.0 4 3 4 1 3 2 5 1963 5.0 0 3 32.0 10.0 10.0 5.0 3.0 2.0 1 8 5.0 6 4 4 2 7 4 4 6 2 3 2 2 4 2 2 0.0 0.0 2 4 20.0 4.0 0.0 5.0 1.0 2.0 9.0 3.0 0.0 1.0 4.0 1992.0 W 3.0 6 6B 43 1.0 4.0 1.0 0.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0
In [3]:
# Load in the feature summary file.
feat_info = pd.read_csv('data/AZDIAS_Feature_Summary.csv', sep=';')
feat_info.head()
Out[3]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
In [4]:
# Check the structure of the data after it's loaded (e.g. print the number of
# rows and columns, print the first few rows).

gen_pop.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Data columns (total 85 columns):
AGER_TYP                 891221 non-null int64
ALTERSKATEGORIE_GROB     891221 non-null int64
ANREDE_KZ                891221 non-null int64
CJT_GESAMTTYP            886367 non-null float64
FINANZ_MINIMALIST        891221 non-null int64
FINANZ_SPARER            891221 non-null int64
FINANZ_VORSORGER         891221 non-null int64
FINANZ_ANLEGER           891221 non-null int64
FINANZ_UNAUFFAELLIGER    891221 non-null int64
FINANZ_HAUSBAUER         891221 non-null int64
FINANZTYP                891221 non-null int64
GEBURTSJAHR              891221 non-null int64
GFK_URLAUBERTYP          886367 non-null float64
GREEN_AVANTGARDE         891221 non-null int64
HEALTH_TYP               891221 non-null int64
LP_LEBENSPHASE_FEIN      886367 non-null float64
LP_LEBENSPHASE_GROB      886367 non-null float64
LP_FAMILIE_FEIN          886367 non-null float64
LP_FAMILIE_GROB          886367 non-null float64
LP_STATUS_FEIN           886367 non-null float64
LP_STATUS_GROB           886367 non-null float64
NATIONALITAET_KZ         891221 non-null int64
PRAEGENDE_JUGENDJAHRE    891221 non-null int64
RETOURTYP_BK_S           886367 non-null float64
SEMIO_SOZ                891221 non-null int64
SEMIO_FAM                891221 non-null int64
SEMIO_REL                891221 non-null int64
SEMIO_MAT                891221 non-null int64
SEMIO_VERT               891221 non-null int64
SEMIO_LUST               891221 non-null int64
SEMIO_ERL                891221 non-null int64
SEMIO_KULT               891221 non-null int64
SEMIO_RAT                891221 non-null int64
SEMIO_KRIT               891221 non-null int64
SEMIO_DOM                891221 non-null int64
SEMIO_KAEM               891221 non-null int64
SEMIO_PFLICHT            891221 non-null int64
SEMIO_TRADV              891221 non-null int64
SHOPPER_TYP              891221 non-null int64
SOHO_KZ                  817722 non-null float64
TITEL_KZ                 817722 non-null float64
VERS_TYP                 891221 non-null int64
ZABEOTYP                 891221 non-null int64
ALTER_HH                 817722 non-null float64
ANZ_PERSONEN             817722 non-null float64
ANZ_TITEL                817722 non-null float64
HH_EINKOMMEN_SCORE       872873 non-null float64
KK_KUNDENTYP             306609 non-null float64
W_KEIT_KIND_HH           783619 non-null float64
WOHNDAUER_2008           817722 non-null float64
ANZ_HAUSHALTE_AKTIV      798073 non-null float64
ANZ_HH_TITEL             794213 non-null float64
GEBAEUDETYP              798073 non-null float64
KONSUMNAEHE              817252 non-null float64
MIN_GEBAEUDEJAHR         798073 non-null float64
OST_WEST_KZ              798073 non-null object
WOHNLAGE                 798073 non-null float64
CAMEO_DEUG_2015          792242 non-null object
CAMEO_DEU_2015           792242 non-null object
CAMEO_INTL_2015          792242 non-null object
KBA05_ANTG1              757897 non-null float64
KBA05_ANTG2              757897 non-null float64
KBA05_ANTG3              757897 non-null float64
KBA05_ANTG4              757897 non-null float64
KBA05_BAUMAX             757897 non-null float64
KBA05_GBZ                757897 non-null float64
BALLRAUM                 797481 non-null float64
EWDICHTE                 797481 non-null float64
INNENSTADT               797481 non-null float64
GEBAEUDETYP_RASTER       798066 non-null float64
KKK                      770025 non-null float64
MOBI_REGIO               757897 non-null float64
ONLINE_AFFINITAET        886367 non-null float64
REGIOTYP                 770025 non-null float64
KBA13_ANZAHL_PKW         785421 non-null float64
PLZ8_ANTG1               774706 non-null float64
PLZ8_ANTG2               774706 non-null float64
PLZ8_ANTG3               774706 non-null float64
PLZ8_ANTG4               774706 non-null float64
PLZ8_BAUMAX              774706 non-null float64
PLZ8_HHZ                 774706 non-null float64
PLZ8_GBZ                 774706 non-null float64
ARBEIT                   794005 non-null float64
ORTSGR_KLS9              794005 non-null float64
RELAT_AB                 794005 non-null float64
dtypes: float64(49), int64(32), object(4)
memory usage: 751.9 MB

That's weird. This says that I have 891,221 rows but the data dictionary and Udacity project page indicate that there should be 10 fewer entries than that, 891,211. A typo in the data dictionary page perhaps? Or maybe the data have been updated?

Or perhaps something more nefarious?

Step 1: Preprocessing

Step 1.1: Assess Missing Data

The feature summary file contains a summary of properties for each demographics data column. You will use this file to help you make cleaning decisions during this stage of the project. First of all, you should assess the demographics data in terms of missing data. Pay attention to the following points as you perform your analysis, and take notes on what you observe. Make sure that you fill in the Discussion cell with your findings and decisions at the end of each step that has one!

Step 1.1.1: Convert Missing Value Codes to NaNs

The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object. You'll need to do a little bit of parsing to make use of it to identify and clean the data. Convert data that matches a 'missing' or 'unknown' value code into a numpy NaN value. You might want to see how much data takes on a 'missing' or 'unknown' code, and how much data is naturally missing, as a point of interest.

As one more reminder, you are encouraged to add additional cells to break up your analysis into manageable chunks.

In [5]:
# Look at naturally missing data first

missing = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(gen_pop),2)
missing.sort_values('total missing', ascending = False, inplace=True)
missing
Out[5]:
total missing percent missing
KK_KUNDENTYP 584612 0.66
KBA05_GBZ 133324 0.15
KBA05_ANTG1 133324 0.15
KBA05_ANTG3 133324 0.15
MOBI_REGIO 133324 0.15
KBA05_ANTG2 133324 0.15
KBA05_ANTG4 133324 0.15
KBA05_BAUMAX 133324 0.15
REGIOTYP 121196 0.14
KKK 121196 0.14
PLZ8_HHZ 116515 0.13
PLZ8_BAUMAX 116515 0.13
PLZ8_ANTG4 116515 0.13
PLZ8_GBZ 116515 0.13
PLZ8_ANTG2 116515 0.13
PLZ8_ANTG1 116515 0.13
PLZ8_ANTG3 116515 0.13
W_KEIT_KIND_HH 107602 0.12
KBA13_ANZAHL_PKW 105800 0.12
CAMEO_INTL_2015 98979 0.11
CAMEO_DEU_2015 98979 0.11
CAMEO_DEUG_2015 98979 0.11
ARBEIT 97216 0.11
ORTSGR_KLS9 97216 0.11
RELAT_AB 97216 0.11
ANZ_HH_TITEL 97008 0.11
BALLRAUM 93740 0.11
INNENSTADT 93740 0.11
EWDICHTE 93740 0.11
GEBAEUDETYP_RASTER 93155 0.10
WOHNLAGE 93148 0.10
ANZ_HAUSHALTE_AKTIV 93148 0.10
MIN_GEBAEUDEJAHR 93148 0.10
OST_WEST_KZ 93148 0.10
GEBAEUDETYP 93148 0.10
KONSUMNAEHE 73969 0.08
ANZ_PERSONEN 73499 0.08
WOHNDAUER_2008 73499 0.08
ANZ_TITEL 73499 0.08
ALTER_HH 73499 0.08
TITEL_KZ 73499 0.08
SOHO_KZ 73499 0.08
HH_EINKOMMEN_SCORE 18348 0.02
LP_STATUS_GROB 4854 0.01
ONLINE_AFFINITAET 4854 0.01
LP_FAMILIE_FEIN 4854 0.01
LP_STATUS_FEIN 4854 0.01
LP_LEBENSPHASE_FEIN 4854 0.01
GFK_URLAUBERTYP 4854 0.01
LP_LEBENSPHASE_GROB 4854 0.01
RETOURTYP_BK_S 4854 0.01
CJT_GESAMTTYP 4854 0.01
LP_FAMILIE_GROB 4854 0.01
AGER_TYP 0 0.00
ALTERSKATEGORIE_GROB 0 0.00
SEMIO_SOZ 0 0.00
ANREDE_KZ 0 0.00
FINANZ_MINIMALIST 0 0.00
FINANZ_SPARER 0 0.00
FINANZ_VORSORGER 0 0.00
FINANZ_ANLEGER 0 0.00
FINANZ_UNAUFFAELLIGER 0 0.00
FINANZ_HAUSBAUER 0 0.00
FINANZTYP 0 0.00
GEBURTSJAHR 0 0.00
GREEN_AVANTGARDE 0 0.00
HEALTH_TYP 0 0.00
NATIONALITAET_KZ 0 0.00
PRAEGENDE_JUGENDJAHRE 0 0.00
SEMIO_FAM 0 0.00
VERS_TYP 0 0.00
SEMIO_REL 0 0.00
SEMIO_MAT 0 0.00
SEMIO_VERT 0 0.00
SEMIO_LUST 0 0.00
SEMIO_ERL 0 0.00
SEMIO_KULT 0 0.00
SEMIO_RAT 0 0.00
SEMIO_KRIT 0 0.00
SEMIO_DOM 0 0.00
SEMIO_KAEM 0 0.00
SEMIO_PFLICHT 0 0.00
SEMIO_TRADV 0 0.00
SHOPPER_TYP 0 0.00
ZABEOTYP 0 0.00
In [6]:
none_missing_count = len(missing[missing['total missing'] > 0])
print(
    f"{none_missing_count} columns have a nonzero number of naturally missing values")
53 columns have a nonzero number of naturally missing values

Oof, looks like the consumer behavior type (KK_KUNDENTYP) is pretty sorely lacking values. Hard to say if it will be the dominant "missing value" feature once we fill in the coded missing values with np.nan, but given that it's more than 60% missing I'll probably need to drop it as a feature. In my experience, features with that many missing values are rarely useful for modeling.

In [7]:
# Convert the strings for the missing values from the feature summary
# To be proper lists of values to use for filling in NaNs

# First, remove brackets
# Then, split on comma separator
feat_info.loc[:, 'missing_or_unknown'] = \
    feat_info.loc[:, 'missing_or_unknown'].str[1:-1].str.split(',')

feat_info.head()
Out[7]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1, 0]
1 ALTERSKATEGORIE_GROB person ordinal [-1, 0, 9]
2 ANREDE_KZ person categorical [-1, 0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
In [8]:
# What columns are object (probably mixed) type?
gen_pop.columns[gen_pop.dtypes == 'object']
Out[8]:
Index(['OST_WEST_KZ', 'CAMEO_DEUG_2015', 'CAMEO_DEU_2015', 'CAMEO_INTL_2015'], dtype='object')
In [9]:
# What unique data types exist within the object-type columns?
col_types = {}

for col in gen_pop.columns[gen_pop.dtypes == 'object']:
    types = set()
    col_types[col] = set([type(e) for e in gen_pop[col]])
    
col_types
Out[9]:
{'OST_WEST_KZ': {float, str},
 'CAMEO_DEUG_2015': {float, str},
 'CAMEO_DEU_2015': {float, str},
 'CAMEO_INTL_2015': {float, str}}

Good to know! Looks like all object-type columns are of mixed float and str type. This will help us do intelligent typecasting as we parse through the feat_info missing codes below.

A few other notes (ignoring codes used for missing):

  1. OST_WEST_KZ
    1. Can be "O" or "W"
  2. CAMEO_DEUG_2015
    1. All int
  3. CAMEO_DEU_2015
    1. Alphanumeric (of form INT-LETTER)
  4. CAMEO_INTL_2015
    1. All int
In [10]:
def fill_missing(df, missing_codes_mapping, inplace=False):
    '''
    Parses dataframe of missing values and their mapping to individual feature names
    and then fills any of those values found in a dataframe's matching feature columns
    with np.nan.

    Inputs
    ------
    df: pandas DataFrame. Table with features that match the ones for which we have
        missing mappings. Each sample is a person.

    missing_codes_mapping: pandas DataFrame. Contains columns 'attribute' and 
        'missing_or_unknown' that map codes used for missing/unknown values to 
        features/attributes. 'missing_or_unknown' is expected to have elements
        that are lists of str (usually ints, but sometimes chars or empty lists).

    Returns
    -------
    df with NaN values filled in according to missing_codes_mapping
    '''

    # Use deep copy if inplace = False, otherwise use shallow copy
    data = df.copy(deep=not inplace)
    missing_codes = missing_codes_mapping.copy(deep=not inplace)

    def parse_missing_codes(code_list):
        '''
        Goes through a list of str and converts the elements of the list according to the needs 
        of the dtypes in our demographic data such that the results can be used for 
        filling in NaN values.

        Inputs
        ------
        code_list: list of str. List is expected to contain the chars, floats, or ints 
            that are codes indicating a missing or unknown value.

        Returns
        -------
        list or np.nan. Each element of the list returned is typecast according to 
            the expected needs of the NaN-filling it will be doing. Empty lists
            (or lists with only an empty string in them) are returned as np.nan.
        '''

        # Make sure list isn't just empty string
        if '' not in code_list:
            # Check if list can be converted to int without issues - if so, do it
            try:
                return [int(e) for e in code_list]

            # Not all are cast-able to int
            except ValueError:
                return [float(e) if 'X' not in e else e for e in code_list]

        else:
            return np.nan

    # Typecast missing value codes appropriately
    missing_codes.loc[:, 'missing_or_unknown'] = \
        missing_codes.loc[:, 'missing_or_unknown'].apply(parse_missing_codes)

    # Create series that maps feature names (index) to missing codes (data)
    code_map = pd.Series(data=missing_codes['missing_or_unknown'].values,
                         index=missing_codes['attribute'].values)

    # When passing a Series into to_replace, index is key and data is value (like a dict)
    data.replace(to_replace=code_map,
                 value=np.nan,
                 inplace=True)

    return data
In [11]:
# Replace codes we know to mean "missing" or "unknown" with NaN
gen_pop = fill_missing(gen_pop, feat_info)
gen_pop.head()
Out[11]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GEBURTSJAHR GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ TITEL_KZ VERS_TYP ZABEOTYP ALTER_HH ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE KK_KUNDENTYP W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_BAUMAX KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 NaN 2.0 1 2.0 3 4 3 5 5 3 4 NaN 10.0 0 NaN 15.0 4.0 2.0 2.0 1.0 1.0 NaN NaN 5.0 2 6 7 5 1 5 3 3 4 7 6 6 5 3 NaN NaN NaN NaN 3 NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN 1.0 2 5.0 1 5 2 5 4 5 1 1996.0 10.0 0 3.0 21.0 6.0 5.0 3.0 2.0 1.0 1.0 14.0 1.0 5 4 4 3 1 2 2 3 6 4 7 4 7 6 3.0 1.0 NaN 2.0 5 NaN 2.0 0.0 6.0 NaN 3.0 9.0 11.0 0.0 8.0 1.0 1992.0 W 4.0 8 8A 51 0.0 0.0 0.0 2.0 5.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 NaN 3.0 2 3.0 1 4 1 2 3 5 1 1979.0 10.0 1 3.0 3.0 1.0 1.0 1.0 3.0 2.0 1.0 15.0 3.0 4 1 3 3 4 4 6 3 4 7 7 7 3 3 2.0 0.0 NaN 1.0 5 17.0 1.0 0.0 4.0 NaN 3.0 9.0 10.0 0.0 1.0 5.0 1992.0 W 2.0 4 4C 24 1.0 3.0 1.0 0.0 NaN 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2.0 4.0 2 2.0 4 2 5 2 1 2 6 1957.0 1.0 0 2.0 NaN NaN NaN NaN 9.0 4.0 1.0 8.0 2.0 5 1 2 1 4 4 7 4 3 4 4 5 4 4 1.0 0.0 NaN 1.0 3 13.0 0.0 0.0 1.0 NaN NaN 9.0 1.0 0.0 1.0 4.0 1997.0 W 7.0 2 2A 12 4.0 1.0 0.0 0.0 1.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 NaN 3.0 1 5.0 4 3 4 1 3 2 5 1963.0 5.0 0 3.0 32.0 10.0 10.0 5.0 3.0 2.0 1.0 8.0 5.0 6 4 4 2 7 4 4 6 2 3 2 2 4 2 2.0 0.0 NaN 2.0 4 20.0 4.0 0.0 5.0 1.0 2.0 9.0 3.0 0.0 1.0 4.0 1992.0 W 3.0 6 6B 43 1.0 4.0 1.0 0.0 NaN 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0

Step 1.1.2: Assess Missing Data in Each Column

How much missing data is present in each column? There are a few columns that are outliers in terms of the proportion of values that are missing. You will want to use matplotlib's hist() function to visualize the distribution of missing value counts to find these columns. Identify and document these columns. While some of these columns might have justifications for keeping or re-encoding the data, for this project you should just remove them from the dataframe. (Feel free to make remarks about these outlier columns in the discussion, however!)

For the remaining features, are there any patterns in which columns have, or share, missing data?

In [14]:
# What do the missing data counts look like now, by column?

missing_filled = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing_filled['percent missing'] = round(missing_filled['total missing'] / len(gen_pop),2)
missing_filled.sort_values('total missing', ascending = False, inplace=True)
missing_filled
Out[14]:
total missing percent missing
TITEL_KZ 889061 1.00
AGER_TYP 685843 0.77
KK_KUNDENTYP 584612 0.66
KBA05_BAUMAX 476524 0.53
GEBURTSJAHR 392318 0.44
ALTER_HH 310267 0.35
KKK 158064 0.18
REGIOTYP 158064 0.18
W_KEIT_KIND_HH 147988 0.17
KBA05_ANTG1 133324 0.15
KBA05_ANTG2 133324 0.15
KBA05_ANTG3 133324 0.15
KBA05_ANTG4 133324 0.15
KBA05_GBZ 133324 0.15
MOBI_REGIO 133324 0.15
PLZ8_ANTG3 116515 0.13
PLZ8_ANTG2 116515 0.13
PLZ8_GBZ 116515 0.13
PLZ8_HHZ 116515 0.13
PLZ8_ANTG1 116515 0.13
PLZ8_BAUMAX 116515 0.13
PLZ8_ANTG4 116515 0.13
VERS_TYP 111196 0.12
HEALTH_TYP 111196 0.12
SHOPPER_TYP 111196 0.12
NATIONALITAET_KZ 108315 0.12
PRAEGENDE_JUGENDJAHRE 108164 0.12
KBA13_ANZAHL_PKW 105800 0.12
ANZ_HAUSHALTE_AKTIV 99611 0.11
CAMEO_INTL_2015 99352 0.11
CAMEO_DEU_2015 99352 0.11
CAMEO_DEUG_2015 99352 0.11
LP_LEBENSPHASE_FEIN 97632 0.11
ARBEIT 97375 0.11
RELAT_AB 97375 0.11
ORTSGR_KLS9 97274 0.11
ANZ_HH_TITEL 97008 0.11
LP_LEBENSPHASE_GROB 94572 0.11
INNENSTADT 93740 0.11
EWDICHTE 93740 0.11
BALLRAUM 93740 0.11
GEBAEUDETYP_RASTER 93155 0.10
WOHNLAGE 93148 0.10
OST_WEST_KZ 93148 0.10
MIN_GEBAEUDEJAHR 93148 0.10
GEBAEUDETYP 93148 0.10
LP_FAMILIE_FEIN 77792 0.09
LP_FAMILIE_GROB 77792 0.09
KONSUMNAEHE 73969 0.08
WOHNDAUER_2008 73499 0.08
SOHO_KZ 73499 0.08
ANZ_TITEL 73499 0.08
ANZ_PERSONEN 73499 0.08
HH_EINKOMMEN_SCORE 18348 0.02
GFK_URLAUBERTYP 4854 0.01
CJT_GESAMTTYP 4854 0.01
LP_STATUS_FEIN 4854 0.01
LP_STATUS_GROB 4854 0.01
RETOURTYP_BK_S 4854 0.01
ONLINE_AFFINITAET 4854 0.01
ALTERSKATEGORIE_GROB 2881 0.00
FINANZ_UNAUFFAELLIGER 0 0.00
FINANZTYP 0 0.00
FINANZ_HAUSBAUER 0 0.00
GREEN_AVANTGARDE 0 0.00
FINANZ_SPARER 0 0.00
FINANZ_MINIMALIST 0 0.00
FINANZ_VORSORGER 0 0.00
FINANZ_ANLEGER 0 0.00
ANREDE_KZ 0 0.00
SEMIO_KAEM 0 0.00
SEMIO_SOZ 0 0.00
SEMIO_PFLICHT 0 0.00
SEMIO_FAM 0 0.00
SEMIO_REL 0 0.00
SEMIO_MAT 0 0.00
SEMIO_VERT 0 0.00
SEMIO_LUST 0 0.00
SEMIO_ERL 0 0.00
SEMIO_KULT 0 0.00
SEMIO_RAT 0 0.00
SEMIO_KRIT 0 0.00
SEMIO_DOM 0 0.00
SEMIO_TRADV 0 0.00
ZABEOTYP 0 0.00

Well that's a non-trivial change! The first profile I generated prior to filling in missing values had 6.4% of all values missing, now we're at 10.6%! Additionally:

  1. Before NaN-filling, we only had one feature with more than 15% of values missing (KK_KUNDENTYP) and 32 features with no missing values at all.
    1. After filling in the NaN codes with np.nan, we end up with 9 having more than 15% of values missing and 25 with no missing values.
  2. AGER_TYPE has 77% of its values missing. It wasn't even flagged as special prior to the NaN-filling step.
  3. ALTER_HH went from 8.2% missing to 34.8% missing!
  4. Other examples exist (6 more I haven't mentioned that are above 15% missing)
In [15]:
none_missing_count = len(missing[missing['total missing'] > 0])
print(
    f"{none_missing_count} columns have a nonzero number of missing values now")
53 columns have a nonzero number of missing values now
In [16]:
# Visualize the missing count across columns

fig, (ax1, ax2) = plt.subplots(nrows=2)

# Total counts of missing
sns.distplot(gen_pop.isnull().sum(axis=0), bins=50, ax=ax1)
ax1.set(title=f'Distribution of Missing Values in Each Column')

# Percentage missing
sns.distplot(gen_pop.isnull().sum(axis=0)/len(gen_pop), bins=50, ax=ax2)
ax2.set(title=f'Distribution of Missing Values in Each Column', ymargin=-0.4)

plt.tight_layout()

It seems pretty clear here that any features that have more than 200k (20%) missing values are "outliers" in the sense that they are far from the majority of features which are below this threshold. But, that being said, perhaps we should look at each feature first and determine if it might have a lot of relevant information useful for analysis, even if it is substantially lacking values?

In [17]:
missing_filled[missing_filled['total missing'] > 2E5]
Out[17]:
total missing percent missing
TITEL_KZ 889061 1.00
AGER_TYP 685843 0.77
KK_KUNDENTYP 584612 0.66
KBA05_BAUMAX 476524 0.53
GEBURTSJAHR 392318 0.44
ALTER_HH 310267 0.35

Here are the meanings of these six features:

  1. TITEL_KZ: Academic Title - categorical
    • Since this one is missing all or nearly all of its values, I see no reason this should be retained
  2. AGER_TYP: Elderly Type - categorical
    • To be honest, I don't quite know what this one even is. And since it would only apply to a small portion of the dataset (those considered "elderly") it seems like it isn't crucial in terms of providing information, especially since we have other features that provide us with age-related information anyhow.
    • Let's drop it.
  3. KK_KUNDENTYP: Household-level customer type - categorical
    • As this one has a large amount of missing values (66%) and seems somewhat duplicative with RETOURTYP_BK_S, I think we can drop it
  4. KBA05_BAUMAX: RR3: Most Common Bldg Type - mixed
    • We can probably drop this one too, as it's more than 50% missing and is somewhat duplicative in information with the other KBA05 features
  5. GEBURTSJAHR: Birth Year - numeric
    • This is where it gets a bit tougher. I tend not to be too worried about dropping features with more than 50% missing values, but this one is only at 44%. As this is the specific birth year (at least as implied by the data dictionary), it is likely a better measure of the age of each individual in the dataset than even the ALTERSKATEGORIE_GROB feature, so we'll retain it
    • I freely admit that I originally intended to drop this one, as it was well outside the majority of features in terms of the distribution of missing values by feature shown in the preceding visualization, but the instructions later in this project made me change my mind (since they indicate I shouldn't have dropped any numeric or interval features). Given that, upon further research I found some guidance to retain a feature and impute its missing values when it has less than 50-60% missing, I was willing to be convinced that this feature should be retained.
  6. ALTER_HH: Head of Household birthdate (bin) - interval
    • This one is a little less helpful than GEBURTSJAHR as it only provides the head of household's birth year in 5-year bins. So given that fact and that it's still significantly more missing values than the rest of the features with less than 200,000 missing values, I'm going to drop it.
In [18]:
# Remove the outlier columns from the dataset

outlier_cols = ['ALTER_HH', 'KBA05_BAUMAX', 'KK_KUNDENTYP',
                'AGER_TYP', 'TITEL_KZ']
gen_pop.drop(columns=outlier_cols, inplace=True)
In [19]:
# Re-assess after dropping outlier features

missing_noOutliers = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing_noOutliers['percent missing'] = round(missing_noOutliers['total missing'] / len(gen_pop),2)
missing_noOutliers.sort_values('total missing', ascending = False, inplace=True)
missing_noOutliers
Out[19]:
total missing percent missing
GEBURTSJAHR 392318 0.44
REGIOTYP 158064 0.18
KKK 158064 0.18
W_KEIT_KIND_HH 147988 0.17
KBA05_GBZ 133324 0.15
KBA05_ANTG4 133324 0.15
KBA05_ANTG3 133324 0.15
KBA05_ANTG2 133324 0.15
KBA05_ANTG1 133324 0.15
MOBI_REGIO 133324 0.15
PLZ8_ANTG3 116515 0.13
PLZ8_ANTG4 116515 0.13
PLZ8_ANTG2 116515 0.13
PLZ8_BAUMAX 116515 0.13
PLZ8_HHZ 116515 0.13
PLZ8_GBZ 116515 0.13
PLZ8_ANTG1 116515 0.13
VERS_TYP 111196 0.12
SHOPPER_TYP 111196 0.12
HEALTH_TYP 111196 0.12
NATIONALITAET_KZ 108315 0.12
PRAEGENDE_JUGENDJAHRE 108164 0.12
KBA13_ANZAHL_PKW 105800 0.12
ANZ_HAUSHALTE_AKTIV 99611 0.11
CAMEO_INTL_2015 99352 0.11
CAMEO_DEUG_2015 99352 0.11
CAMEO_DEU_2015 99352 0.11
LP_LEBENSPHASE_FEIN 97632 0.11
RELAT_AB 97375 0.11
ARBEIT 97375 0.11
ORTSGR_KLS9 97274 0.11
ANZ_HH_TITEL 97008 0.11
LP_LEBENSPHASE_GROB 94572 0.11
INNENSTADT 93740 0.11
EWDICHTE 93740 0.11
BALLRAUM 93740 0.11
GEBAEUDETYP_RASTER 93155 0.10
WOHNLAGE 93148 0.10
OST_WEST_KZ 93148 0.10
MIN_GEBAEUDEJAHR 93148 0.10
GEBAEUDETYP 93148 0.10
LP_FAMILIE_GROB 77792 0.09
LP_FAMILIE_FEIN 77792 0.09
KONSUMNAEHE 73969 0.08
WOHNDAUER_2008 73499 0.08
ANZ_PERSONEN 73499 0.08
ANZ_TITEL 73499 0.08
SOHO_KZ 73499 0.08
HH_EINKOMMEN_SCORE 18348 0.02
RETOURTYP_BK_S 4854 0.01
GFK_URLAUBERTYP 4854 0.01
LP_STATUS_FEIN 4854 0.01
LP_STATUS_GROB 4854 0.01
CJT_GESAMTTYP 4854 0.01
ONLINE_AFFINITAET 4854 0.01
ALTERSKATEGORIE_GROB 2881 0.00
ANREDE_KZ 0 0.00
SEMIO_TRADV 0 0.00
FINANZ_MINIMALIST 0 0.00
FINANZ_SPARER 0 0.00
FINANZ_VORSORGER 0 0.00
FINANZ_ANLEGER 0 0.00
FINANZ_UNAUFFAELLIGER 0 0.00
FINANZ_HAUSBAUER 0 0.00
FINANZTYP 0 0.00
GREEN_AVANTGARDE 0 0.00
SEMIO_SOZ 0 0.00
SEMIO_FAM 0 0.00
SEMIO_REL 0 0.00
SEMIO_MAT 0 0.00
SEMIO_VERT 0 0.00
SEMIO_LUST 0 0.00
SEMIO_ERL 0 0.00
SEMIO_KULT 0 0.00
SEMIO_RAT 0 0.00
SEMIO_KRIT 0 0.00
SEMIO_DOM 0 0.00
SEMIO_KAEM 0 0.00
SEMIO_PFLICHT 0 0.00
ZABEOTYP 0 0.00
In [20]:
# What does missing count distribution look like now?

sns.distplot(gen_pop.isnull().sum(axis=0)/len(gen_pop),
            bins = 50)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1b52eac8>
In [21]:
# First things first: if I'm going to make sense out of these, I need more helpful names

# Read in the names mapping CSV as a dict
new_names = pd.read_csv('col_renaming.csv', header=None, index_col=0,
                        squeeze=True).to_dict()

gen_pop.rename(columns=new_names, inplace=True)
In [22]:
# Need to stop sorting by amount missing and leave in original sort order
# so I can more easily compare to the data dictionary file

missing_noSort = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing_noSort['percent missing'] = round(missing_noSort['total missing'] / len(gen_pop),2)

missing_noSort[missing_noSort['total missing'] < 50000]
#missing_noSort[missing_noSort['total missing'] > 50000]
Out[22]:
total missing percent missing
Age Bin 2881 0.00
Gender 0 0.00
Consumption Channel Type 4854 0.01
MoneyType__Minimalist 0 0.00
MoneyType__Saver 0 0.00
MoneyType__Preparer 0 0.00
MoneyType__Investor 0 0.00
MoneyType__Inconspicuous 0 0.00
MoneyType__Homeowner 0 0.00
MoneyType__Primary 0 0.00
Vacation Habits 4854 0.01
Young Environmentalist 0 0.00
Socioeconomic Status - HighRes 4854 0.01
Socioeconomic Status - LowRes 4854 0.01
Customer Type 4854 0.01
PersonalityType__Social 0 0.00
PersonalityType__Family 0 0.00
PersonalityType__Religious 0 0.00
PersonalityType__Materialist 0 0.00
PersonalityType__Dreamer 0 0.00
PersonalityType__Sensual 0 0.00
PersonalityType__Event 0 0.00
PersonalityType__Cultured 0 0.00
PersonalityType__Rational 0 0.00
PersonalityType__Critical 0 0.00
PersonalityType__Dominant 0 0.00
PersonalityType__Combative 0 0.00
PersonalityType__Dutiful 0 0.00
PersonalityType__Traditional 0 0.00
Energy Consumption Type 0 0.00
HH: Net Income Bins 18348 0.02
RR1: Online Affinity 4854 0.01
In [23]:
#missing_noSort[missing_noSort['total missing'] < 50000]
missing_noSort[missing_noSort['total missing'] > 50000]
Out[23]:
total missing percent missing
Birth Year 392318 0.44
Health Type 111196 0.12
Life Stage - HighRes 97632 0.11
Life Stage - LowRes 94572 0.11
Family Type - HighRes 77792 0.09
Family Type - LowRes 77792 0.09
Nationality Based on Name 108315 0.12
Generation Designation 108164 0.12
Shopper Type 111196 0.12
Small or Home Office Owner 73499 0.08
Insurance Type 111196 0.12
HH: Number of Adults in HH 73499 0.08
HH: Number of Academic Title Holders in HH 73499 0.08
HH: Probability of Children in Residence 147988 0.17
HH: Length of Residency (bins) 73499 0.08
Bldg: Number of HHs 99611 0.11
Bldg: Number of Academic Title Holders 97008 0.11
Bldg: Building Type 93148 0.10
Bldg: Distance to Point of Sale Category 73969 0.08
Bldg: Year of Building's Initial Database Entry 93148 0.10
Bldg: Location Relative to E or W Germany 93148 0.10
Bldg: Neighborhood Quality 93148 0.10
RR4: Life Stage Type - LowRes 99352 0.11
RR4: Life Stage Type - HighRes 99352 0.11
RR4: Life Stage Type - Int'l Code Mapping 99352 0.11
RR3: Bins of 1-2 Family Homes 133324 0.15
RR3: Bins of 3-5 Family Homes 133324 0.15
RR3: Bins of 6-10 Family Homes 133324 0.15
RR3: Bins of 10+ Family Homes 133324 0.15
RR3: Bin Counts of Bldgs 133324 0.15
PostCode: Distance to Nearest Urban Center (bins) 93740 0.11
PostCode: Density of HHs per km^2 (bins) 93740 0.11
PostCode: Distance to City Center (bins) 93740 0.11
RR1: Residential-Commercial Activity Ratio (categories) 93155 0.10
RR1: Purchasing Power (bins) 158064 0.18
RR1: Movement Patterns 133324 0.15
RR1: Neighborhood Type 158064 0.18
PLZ8: Number of Cars 105800 0.12
PLZ8: Bins of 1-2 Family Homes 116515 0.13
PLZ8: Bins of 3-5 Family Homes 116515 0.13
PLZ8: Bins of 6-10 Family Homes 116515 0.13
PLZ8: Bins of 10+ Family Homes 116515 0.13
PLZ8: Most Common Bldg Type 116515 0.13
PLZ8: Bin Counts of HHs 116515 0.13
PLZ8: Bin Counts of Bldgs 116515 0.13
Community: Share of Unemployment 97375 0.11
Community: Size (bins) 97274 0.11
Community: Share of Unemployment Relative to Parent County 97375 0.11

Discussion 1.1.2: Assess Missing Data in Each Column

Some notes on patterns I noticed during my exploration of the missing data as a function of column/feature:

  1. Before matching codes for unknown/missing values and filling in those values with np.nan, we had 6.4% of all values reported as missing. Afterwards, 10.6% of all values were missing.
  2. I removed features with more than 50% missing values as they appeared to be outliers in terms of the amount of missing values relative to the overall feature set and for the reasons discussed above.
  3. In the data that are remaining, I see clear breakpoints between features with 0% - 2% missing values, those with 8% - 18% missing, and the two features with 35% and 44% missing values.
    • Those in the group with 0% - 2% missing:
      • There are 32 features included in this subset/sample
      • Of those 32, all but two (RR1: Online Affinity and HH: Net Income Bins) are person-level measures (30, in other words). Given that ~50% of all features are person-level, this suggests that person-level features are overrepresented in the subset (if the information levels were distributed the same in this sample as they are in the total population, we'd expect only 16-17 features to be person-level).
        • Also, this means that 70% of the total 43 possible person-level features in the full feature space are represented in this sample.
      • Given that Head of Household birthdate and individual birth year both had very high missing data counts (causing them to be dropped as outliers), it's surprising to see Age Bin with such a relatively low number of missing values. That seems really odd, but likely it is due to the method used for determining the age of an individual. The Age Bin values cover 15-30 year chunks of a person's life, whereas the birth year and head of household birth date features had resolutions of 1 and 5 years, resp.
      • It's also interesting to note that Consumption Channel Type, Vacation Habits, Customer Type, Socioeconomic Status - HighRes, Socioeconomic Status - LowRes, and Online Affinity all have the exact same number of missing values. This suggests to me that they all came from the same data source and likely used the same methodology for their derivation, as this pattern seems a little too consistent to be random.
    • Those in the group with 8% - 18% missing:
      • There are 47 features included in this sample/subset
      • Of those 47, 10 (21%) are person-level features, suggesting that missing/unknown data are much more common at the higher-level resolutions than at the person-level.
        • This strikes me as odd, given especially that many of those person-level feature values were likely not directly measured, but rather derived from higher-level datasets (e.g. statistically modeled from census data).
      • It seems particularly odd that so many of the features that have higher missing value rates seem to be highly measurable quantities (e.g. from census/GIS data, such as the number of inhabitants in a community).
    • I already discussed my thoughts on those with 35% and 44% missing.
      • I retained the 44% missing birth year feature, as it seems to provide significant information useful for later analysis
      • One other thought: these are age-related variables, one at the person-level and one at the household-level. It seems odd to me that the Age Bin feature should be in the lowest missing value count group and these in the highest. Even though they have significantly higher resolutions than the Age Bin feature, they must still be drawn from very different data sources, since they shouldn't be so different. Strange.

Step 1.1.3: Assess Missing Data in Each Row

Now, you'll perform a similar assessment for the rows of the dataset. How much data is missing in each row? As with the columns, you should see some groups of points that have a very different numbers of missing values. Divide the data into two subsets: one for data points that are above some threshold for missing values, and a second subset for points below that threshold.

In order to know what to do with the outlier rows, we should see if the distribution of data values on columns that are not missing data (or are missing very little data) are similar or different between the two groups. Select at least five of these columns and compare the distribution of values.

  • You can use seaborn's countplot() function to create a bar chart of code frequencies and matplotlib's subplot() function to put bar charts for the two subplots side by side.
  • To reduce repeated code, you might want to write a function that can perform this comparison, taking as one of its arguments a column to be compared.

Depending on what you observe in your comparison, this will have implications on how you approach your conclusions later in the analysis. If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then we could argue that simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special. We'll revisit these data later on. Either way, you should continue your analysis for now using just the subset of the data with few or no missing values.

In [24]:
# How many duplicate rows are there?

dups = len(gen_pop) - len(gen_pop.drop_duplicates())

print(f"There are {dups} duplicate rows in the dataset, \
representing {round(dups/len(gen_pop)*100,2)}% of all samples.")
There are 52517 duplicate rows in the dataset, representing 5.89% of all samples.

This is a non-trivial fraction of data that are duplicates. However, since we don't have person-level IDs for each record, it's impossible to tell if these are truly duplicates or merely extremely similar people. As such, we'll leave these alone.

In [25]:
# What do the missing data counts look like now, by row?

rows_oneMissing_filled = len(gen_pop) - len(gen_pop.dropna(how='any'))
rows_allMissing_filled = len(gen_pop) - len(gen_pop.dropna(how='all'))

print(f"There are {rows_oneMissing_filled} samples with at least one feature missing a value \
\nand {rows_allMissing_filled} samples missing values for all features")
There are 503851 samples with at least one feature missing a value 
and 0 samples missing values for all features
In [26]:
# How much data is missing in each row of the dataset?

fig, ax = plt.subplots()
fig.suptitle('Fraction of Different Amounts of Missing Values Across Rows')

sns.distplot(gen_pop.isnull().sum(axis=1), kde=False, norm_hist=True, ax=ax)
ax.set(xlabel='Number of Columns with Missing Values')
Out[26]:
[Text(0.5, 0, 'Number of Columns with Missing Values')]
In [27]:
# How much data is missing in each row of the dataset?

fig, ax = plt.subplots()
fig.suptitle('Fraction of Different Amounts of Missing Values Across Rows')

sns.distplot(gen_pop.isnull().sum(axis=1), kde=False, norm_hist=True, ax=ax)
ax.set(xlabel='Number of Columns with Missing Values', xlim=(0,5))
Out[27]:
[(0, 5), Text(0.5, 0, 'Number of Columns with Missing Values')]

Well, it looks like a pretty safe bet to say that we should subset the data into rows with up to one missing value and rows with 2+ missing values, since the former comprises around 70% of the data it seems.

In [28]:
# Write code to divide the data into two subsets based on the number of missing
# values in each row.

gen_pop_lowMissing = gen_pop.loc[gen_pop.isnull().sum(axis=1) < 2,:]
gen_pop_highMissing = gen_pop.loc[gen_pop.isnull().sum(axis=1) >= 2,:]


# Let's just check that our numbers work out as we expct them to
print(f"There are {len(gen_pop_lowMissing)} rows that have one or fewer missing values")
print(f"There are {len(gen_pop_highMissing)} rows that have 2+ missing values")
print(f"There are {len(gen_pop_lowMissing) + len(gen_pop_highMissing)} \
rows when you combine these two")
print(f"\nThere are {len(gen_pop)} rows in the full dataset")
There are 629236 rows that have one or fewer missing values
There are 261985 rows that have 2+ missing values
There are 891221 rows when you combine these two

There are 891221 rows in the full dataset
In [29]:
# Compare the distribution of values for at least five columns where there are
# no or few missing values, between the two subsets.

def dist_compare(df, missing_count, column):
    '''
    Compares the distribution of values for a single feature 
    between two subsets of a DataFrame, subsetting based upon the number of missing values
    allowed per row (e.g. one distribution is only of rows with less than X missing values,
    the other is only of rows with at least X missing values).
    
    Inputs
    ------
    df: pandas DataFrame in long format
        
    missing_count: int. Number of missing values to use as breakpoint for subsetting df
    
    column: str. Column name of feature from the DataFrames that you're investigating.
    
    
    Returns
    -------
    Matplotlib figure comparing the two distributions.
    '''

    fig, (ax_lowMissing, ax_highMissing) = plt.subplots(ncols=2)
    fig.suptitle(f'Distributions of Mostly Complete Features - {column}\
    \nMissing Values Breakpoint = {missing_count}', 
                 fontsize=15, y=1.2)
    plt.figure(figsize=(20,5))

    
    sns.distplot(df.loc[df.isnull().sum(axis=1) < missing_count, column].dropna(), 
                 ax=ax_lowMissing)
    ax_lowMissing.set(title=f'Rows with Fewer \nMissing Values')

    sns.distplot(df.loc[df.isnull().sum(axis=1) >= missing_count, column].dropna(), 
                 ax=ax_highMissing)
    ax_highMissing.set(title=f'Rows with More \nMissing Values')
In [30]:
# The features that are completely or almost completely filled
# and likely to show more than just two values in the distribution

full_cols = ['Energy Consumption Type', 'MoneyType__Primary',
            'Age Bin', 'Customer Type', 'RR1: Online Affinity',
            'Vacation Habits']
In [31]:
missing_values_breakpoint = 2
In [32]:
dist_compare(gen_pop, missing_values_breakpoint, 'Energy Consumption Type')
<Figure size 2000x500 with 0 Axes>

It's clear from these plots of Energy Consumption Type that, while both subsets seem to have a majority of people with an energy consumption pattern of 3 ("fair supplied"), this majority is a larger fraction of the total subset for the data with a higher missing value count.

In [33]:
dist_compare(gen_pop, missing_values_breakpoint, 'MoneyType__Primary')
<Figure size 2000x500 with 0 Axes>

Much like with Energy Consumption Type, there seems to be a dominant feature value for MoneyType__Primary when you look at the subset with a high missing value count. In this case, it's code 4: "be prepared". There's also a clearly dominant code for the low missing values data: code 6 ("inconspicuous").

In [34]:
dist_compare(gen_pop, missing_values_breakpoint, 'Age Bin')
<Figure size 2000x500 with 0 Axes>

This one is a little more subtle than the preceding two. For the most part, the two subset distributions for Age Bin look the same, with the only notable difference being that code 4 (> 60 years old) is a smaller fraction of the overall distribution for the high missing values subset than it is for the low missing values one.

In [35]:
dist_compare(gen_pop, missing_values_breakpoint, 'Customer Type')
<Figure size 2000x500 with 0 Axes>

As with Age Bin, Customer Type has a more subtle distributional difference between the two ssubsets. That said, it looks like it would be fair to say that code 5 ("determined Minimal-Returner") is the dominant feature for the low missing values subset and code 3 ("incentive-receptive Normal-Returner")

In [36]:
dist_compare(gen_pop, missing_values_breakpoint, 'RR1: Online Affinity')
<Figure size 2000x500 with 0 Axes>

For RR1: Online Affinity, there can be no doubt: code 2 ("middle") is the dominant feature for the high missing values subset, and the codes are pretty uniformly distributed for the low missing values subset.

In [37]:
dist_compare(gen_pop, missing_values_breakpoint, 'Vacation Habits')
<Figure size 2000x500 with 0 Axes>

For Vacation Habits, the pattern is clear: code 5 (nature fans) is the dominant feature for the high missing values subset and the distribution is pretty uniform across codes (although with a dip for codes 6 and 7 and an interesting overall positive slope) for the low missing values subset.

Discussion 1.1.3: Assess Missing Data in Each Row

Ultimately, it seems like the distributions are more uniform for the rows with low missing value counts than they are for the rows with high missing value counts. As such, it's important to note that the distribution, at least of these features we spot-checked, is not so similar between the two subsets that we can easily say we've lost no information by excluding the rows with higher amounts of missing values.

Step 1.2: Select and Re-Encode Features

Checking for missing data isn't the only way in which you can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, you need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.

  • For numeric and interval data, these features can be kept without changes.
  • Most of the variables in the dataset are ordinal in nature. While ordinal values may technically be non-linear in spacing, make the simplifying assumption that the ordinal variables can be treated as being interval in nature (that is, kept without any changes).
  • Special handling may be necessary for the remaining two variable types: categorical, and 'mixed'.

In the first two parts of this sub-step, you will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, you will create a new data frame with only the selected and engineered columns.

Data wrangling is often the trickiest part of the data analysis process, and there's a lot of it to be done here. But stick with it: once you're done with this step, you'll be ready to get to the machine learning parts of the project!

In [38]:
# Add new attribute names to feat_info

feat_info['attribute_renamed'] = pd.read_csv('col_renaming.csv', header=None)[1]
feat_info
Out[38]:
attribute information_level type missing_or_unknown attribute_renamed
0 AGER_TYP person categorical [-1, 0] Elderly Type
1 ALTERSKATEGORIE_GROB person ordinal [-1, 0, 9] Age Bin
2 ANREDE_KZ person categorical [-1, 0] Gender
3 CJT_GESAMTTYP person categorical [0] Consumption Channel Type
4 FINANZ_MINIMALIST person ordinal [-1] MoneyType__Minimalist
5 FINANZ_SPARER person ordinal [-1] MoneyType__Saver
6 FINANZ_VORSORGER person ordinal [-1] MoneyType__Preparer
7 FINANZ_ANLEGER person ordinal [-1] MoneyType__Investor
8 FINANZ_UNAUFFAELLIGER person ordinal [-1] MoneyType__Inconspicuous
9 FINANZ_HAUSBAUER person ordinal [-1] MoneyType__Homeowner
10 FINANZTYP person categorical [-1] MoneyType__Primary
11 GEBURTSJAHR person numeric [0] Birth Year
12 GFK_URLAUBERTYP person categorical [] Vacation Habits
13 GREEN_AVANTGARDE person categorical [] Young Environmentalist
14 HEALTH_TYP person ordinal [-1, 0] Health Type
15 LP_LEBENSPHASE_FEIN person mixed [0] Life Stage - HighRes
16 LP_LEBENSPHASE_GROB person mixed [0] Life Stage - LowRes
17 LP_FAMILIE_FEIN person categorical [0] Family Type - HighRes
18 LP_FAMILIE_GROB person categorical [0] Family Type - LowRes
19 LP_STATUS_FEIN person categorical [0] Socioeconomic Status - HighRes
20 LP_STATUS_GROB person categorical [0] Socioeconomic Status - LowRes
21 NATIONALITAET_KZ person categorical [-1, 0] Nationality Based on Name
22 PRAEGENDE_JUGENDJAHRE person mixed [-1, 0] Generation Designation
23 RETOURTYP_BK_S person ordinal [0] Customer Type
24 SEMIO_SOZ person ordinal [-1, 9] PersonalityType__Social
25 SEMIO_FAM person ordinal [-1, 9] PersonalityType__Family
26 SEMIO_REL person ordinal [-1, 9] PersonalityType__Religious
27 SEMIO_MAT person ordinal [-1, 9] PersonalityType__Materialist
28 SEMIO_VERT person ordinal [-1, 9] PersonalityType__Dreamer
29 SEMIO_LUST person ordinal [-1, 9] PersonalityType__Sensual
30 SEMIO_ERL person ordinal [-1, 9] PersonalityType__Event
31 SEMIO_KULT person ordinal [-1, 9] PersonalityType__Cultured
32 SEMIO_RAT person ordinal [-1, 9] PersonalityType__Rational
33 SEMIO_KRIT person ordinal [-1, 9] PersonalityType__Critical
34 SEMIO_DOM person ordinal [-1, 9] PersonalityType__Dominant
35 SEMIO_KAEM person ordinal [-1, 9] PersonalityType__Combative
36 SEMIO_PFLICHT person ordinal [-1, 9] PersonalityType__Dutiful
37 SEMIO_TRADV person ordinal [-1, 9] PersonalityType__Traditional
38 SHOPPER_TYP person categorical [-1] Shopper Type
39 SOHO_KZ person categorical [-1] Small or Home Office Owner
40 TITEL_KZ person categorical [-1, 0] Academic Title
41 VERS_TYP person categorical [-1] Insurance Type
42 ZABEOTYP person categorical [-1, 9] Energy Consumption Type
43 ALTER_HH household interval [0] HH: Head of HH Birthdate Bin (5-Year)
44 ANZ_PERSONEN household numeric [] HH: Number of Adults in HH
45 ANZ_TITEL household numeric [] HH: Number of Academic Title Holders in HH
46 HH_EINKOMMEN_SCORE household ordinal [-1, 0] HH: Net Income Bins
47 KK_KUNDENTYP household categorical [-1] HH: Customer Type
48 W_KEIT_KIND_HH household ordinal [-1, 0] HH: Probability of Children in Residence
49 WOHNDAUER_2008 household ordinal [-1, 0] HH: Length of Residency (bins)
50 ANZ_HAUSHALTE_AKTIV building numeric [0] Bldg: Number of HHs
51 ANZ_HH_TITEL building numeric [] Bldg: Number of Academic Title Holders
52 GEBAEUDETYP building categorical [-1, 0] Bldg: Building Type
53 KONSUMNAEHE building ordinal [] Bldg: Distance to Point of Sale Category
54 MIN_GEBAEUDEJAHR building numeric [0] Bldg: Year of Building's Initial Database Entry
55 OST_WEST_KZ building categorical [-1] Bldg: Location Relative to E or W Germany
56 WOHNLAGE building mixed [-1] Bldg: Neighborhood Quality
57 CAMEO_DEUG_2015 microcell_rr4 categorical [-1, X] RR4: Life Stage Type - LowRes
58 CAMEO_DEU_2015 microcell_rr4 categorical [XX] RR4: Life Stage Type - HighRes
59 CAMEO_INTL_2015 microcell_rr4 mixed [-1, XX] RR4: Life Stage Type - Int'l Code Mapping
60 KBA05_ANTG1 microcell_rr3 ordinal [-1] RR3: Bins of 1-2 Family Homes
61 KBA05_ANTG2 microcell_rr3 ordinal [-1] RR3: Bins of 3-5 Family Homes
62 KBA05_ANTG3 microcell_rr3 ordinal [-1] RR3: Bins of 6-10 Family Homes
63 KBA05_ANTG4 microcell_rr3 ordinal [-1] RR3: Bins of 10+ Family Homes
64 KBA05_BAUMAX microcell_rr3 mixed [-1, 0] RR3: Most Common Bldg Type
65 KBA05_GBZ microcell_rr3 ordinal [-1, 0] RR3: Bin Counts of Bldgs
66 BALLRAUM postcode ordinal [-1] PostCode: Distance to Nearest Urban Center (bins)
67 EWDICHTE postcode ordinal [-1] PostCode: Density of HHs per km^2 (bins)
68 INNENSTADT postcode ordinal [-1] PostCode: Distance to City Center (bins)
69 GEBAEUDETYP_RASTER region_rr1 ordinal [] RR1: Residential-Commercial Activity Ratio (ca...
70 KKK region_rr1 ordinal [-1, 0] RR1: Purchasing Power (bins)
71 MOBI_REGIO region_rr1 ordinal [] RR1: Movement Patterns
72 ONLINE_AFFINITAET region_rr1 ordinal [] RR1: Online Affinity
73 REGIOTYP region_rr1 ordinal [-1, 0] RR1: Neighborhood Type
74 KBA13_ANZAHL_PKW macrocell_plz8 numeric [] PLZ8: Number of Cars
75 PLZ8_ANTG1 macrocell_plz8 ordinal [-1] PLZ8: Bins of 1-2 Family Homes
76 PLZ8_ANTG2 macrocell_plz8 ordinal [-1] PLZ8: Bins of 3-5 Family Homes
77 PLZ8_ANTG3 macrocell_plz8 ordinal [-1] PLZ8: Bins of 6-10 Family Homes
78 PLZ8_ANTG4 macrocell_plz8 ordinal [-1] PLZ8: Bins of 10+ Family Homes
79 PLZ8_BAUMAX macrocell_plz8 mixed [-1, 0] PLZ8: Most Common Bldg Type
80 PLZ8_HHZ macrocell_plz8 ordinal [-1] PLZ8: Bin Counts of HHs
81 PLZ8_GBZ macrocell_plz8 ordinal [-1] PLZ8: Bin Counts of Bldgs
82 ARBEIT community ordinal [-1, 9] Community: Share of Unemployment
83 ORTSGR_KLS9 community ordinal [-1, 0] Community: Size (bins)
84 RELAT_AB community ordinal [-1, 9] Community: Share of Unemployment Relative to P...
In [39]:
# How many features are there of each data type?

# I'm only looking at the features that still remain in gen_pop after dropping outlier features
feat_info_remaining = feat_info.loc[feat_info['attribute_renamed'].isin(gen_pop.columns), :]
feat_info_remaining['type'].value_counts()
Out[39]:
ordinal        49
categorical    18
numeric         7
mixed           6
Name: type, dtype: int64

Step 1.2.1: Re-Encode Categorical Features

For categorical data, you would ordinarily need to encode the levels as dummy variables. Depending on the number of categories, perform one of the following:

  • For binary (two-level) categoricals that take numeric values, you can keep them without needing to do anything.
  • There is one binary variable that takes on non-numeric values. For this one, you need to re-encode the values as numbers or create a dummy variable.
  • For multi-level categoricals (three or more values), you can choose to encode the values using multiple dummy variables (e.g. via OneHotEncoder), or (to keep things straightforward) just drop them from the analysis. As always, document your choices in the Discussion section.
In [40]:
# Assess categorical variables: which are binary, which are multi-level, and
# which one needs to be re-encoded?

feat_info_remaining[feat_info_remaining['type'] == 'categorical']
Out[40]:
attribute information_level type missing_or_unknown attribute_renamed
2 ANREDE_KZ person categorical [-1, 0] Gender
3 CJT_GESAMTTYP person categorical [0] Consumption Channel Type
10 FINANZTYP person categorical [-1] MoneyType__Primary
12 GFK_URLAUBERTYP person categorical [] Vacation Habits
13 GREEN_AVANTGARDE person categorical [] Young Environmentalist
17 LP_FAMILIE_FEIN person categorical [0] Family Type - HighRes
18 LP_FAMILIE_GROB person categorical [0] Family Type - LowRes
19 LP_STATUS_FEIN person categorical [0] Socioeconomic Status - HighRes
20 LP_STATUS_GROB person categorical [0] Socioeconomic Status - LowRes
21 NATIONALITAET_KZ person categorical [-1, 0] Nationality Based on Name
38 SHOPPER_TYP person categorical [-1] Shopper Type
39 SOHO_KZ person categorical [-1] Small or Home Office Owner
41 VERS_TYP person categorical [-1] Insurance Type
42 ZABEOTYP person categorical [-1, 9] Energy Consumption Type
52 GEBAEUDETYP building categorical [-1, 0] Bldg: Building Type
55 OST_WEST_KZ building categorical [-1] Bldg: Location Relative to E or W Germany
57 CAMEO_DEUG_2015 microcell_rr4 categorical [-1, X] RR4: Life Stage Type - LowRes
58 CAMEO_DEU_2015 microcell_rr4 categorical [XX] RR4: Life Stage Type - HighRes
In [41]:
# Figure out what different values are available for each feature

cols = feat_info_remaining.loc[feat_info_remaining['type'] == 'categorical', 'attribute_renamed']

unique_cat_vals = pd.DataFrame()

for col in cols:
    temp = pd.DataFrame({'attribute': col, 'unique values': gen_pop[col].unique()})
    unique_cat_vals = pd.concat([unique_cat_vals, temp])
    
unique_cat_vals.groupby(['attribute']).count().sort_values('unique values')
Out[41]:
unique values
attribute
Young Environmentalist 2
Bldg: Location Relative to E or W Germany 2
Small or Home Office Owner 2
Gender 2
Insurance Type 2
Nationality Based on Name 3
Shopper Type 4
Socioeconomic Status - LowRes 5
Family Type - LowRes 5
MoneyType__Primary 6
Energy Consumption Type 6
Consumption Channel Type 6
Bldg: Building Type 7
RR4: Life Stage Type - LowRes 9
Socioeconomic Status - HighRes 10
Family Type - HighRes 11
Vacation Habits 12
RR4: Life Stage Type - HighRes 44
In [42]:
# What are the values taken on by the binary features?

binary_cols = unique_cat_vals.groupby(['attribute']).count().sort_values('unique values').index[:5]
unique_cat_vals[unique_cat_vals['attribute'].isin(binary_cols)]
Out[42]:
attribute unique values
0 Gender 1
1 Gender 2
0 Young Environmentalist 0
1 Young Environmentalist 1
0 Small or Home Office Owner NaN
1 Small or Home Office Owner 1
2 Small or Home Office Owner 0
0 Insurance Type NaN
1 Insurance Type 2
2 Insurance Type 1
0 Bldg: Location Relative to E or W Germany NaN
1 Bldg: Location Relative to E or W Germany W
2 Bldg: Location Relative to E or W Germany O

The following features appear to be binary in nature:

  1. Gender
  2. Young Environmentalist
  3. Bldg: Location Relative to E or W Germany
    • This one takes on values of 'W' and 'O' for West and East Germany, resp.
    • Given it's non-numeric nature, we need to use dummy variables with this
  4. Small or Home Office Owner
    • This one actually has NaN values, which means I need to use dummy variables with it too (so NaN can be reflected by 0,0)
  5. Insurance Type
    • This one also has NaN values, so I'll need to use a dummy with it too!

That leaves the following features for re-encoding through dummies:

  1. Nationality Based on Name
  2. Shopper Type
  3. Socioeconomic Status - LowRes
  4. Family Type - LowRes
  5. MoneyType__Primary
  6. Energy Consumption Type
  7. Consumption Channel Type
  8. Bldg: Building Type
  9. RR4: Life Stage Type - LowRes
  10. Socioeconomic Status - HighRes
  11. Family Type - HighRes
  12. Vacation Habits
  13. RR4: Life Stage Type - HighRes
In [43]:
# Re-encode categorical variable(s) to be kept in the analysis.

cat_cols = ['Bldg: Location Relative to E or W Germany',
           'Small or Home Office Owner', 'Insurance Type',
           'Nationality Based on Name', 'Shopper Type',
            'Socioeconomic Status - LowRes', 'Family Type - LowRes',
            'MoneyType__Primary', 'Energy Consumption Type',
            'Consumption Channel Type', 'Bldg: Building Type',
            'RR4: Life Stage Type - LowRes', 'Socioeconomic Status - HighRes',
            'Family Type - HighRes', 'Vacation Habits', 
            'RR4: Life Stage Type - HighRes']

# Have to drop first dummy so avoid Dummy Variable Trap
# Have to include NaN dummy so zero-vector can't be interpreted ambiguously
    # as either NaN or first dropped dummy (which are likely different)
gen_pop_lowMissing = pd.get_dummies(
    gen_pop_lowMissing, prefix_sep='__', 
    drop_first=True, dummy_na=True,
    columns=cat_cols)

gen_pop_lowMissing.columns
Out[43]:
Index(['Age Bin', 'Gender', 'MoneyType__Minimalist', 'MoneyType__Saver',
       'MoneyType__Preparer', 'MoneyType__Investor',
       'MoneyType__Inconspicuous', 'MoneyType__Homeowner', 'Birth Year',
       'Young Environmentalist',
       ...
       'RR4: Life Stage Type - HighRes__8A',
       'RR4: Life Stage Type - HighRes__8B',
       'RR4: Life Stage Type - HighRes__8C',
       'RR4: Life Stage Type - HighRes__8D',
       'RR4: Life Stage Type - HighRes__9A',
       'RR4: Life Stage Type - HighRes__9B',
       'RR4: Life Stage Type - HighRes__9C',
       'RR4: Life Stage Type - HighRes__9D',
       'RR4: Life Stage Type - HighRes__9E',
       'RR4: Life Stage Type - HighRes__nan'],
      dtype='object', length=198)
In [44]:
# What does our dataframe look like now?

gen_pop_lowMissing.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 629236 entries, 1 to 891220
Columns: 198 entries, Age Bin to RR4: Life Stage Type - HighRes__nan
dtypes: float64(41), int64(22), object(1), uint8(134)
memory usage: 392.5+ MB
In [45]:
gen_pop_lowMissing.dtypes.value_counts()
Out[45]:
uint8      134
float64     41
int64       22
object       1
dtype: int64
In [46]:
gen_pop_lowMissing.columns[gen_pop_lowMissing.dtypes == 'object']
Out[46]:
Index(['RR4: Life Stage Type - Int'l Code Mapping'], dtype='object')

Discussion 1.2.1: Re-Encode Categorical Features

I treated these features as binary and not requiring further engineering:

  1. Gender
  2. Young Environmentalist

I treated these features as binary but requiring dummy variables due to their encoding or due to the presence of NaN values:

  1. Bldg: Location Relative to E or W Germany
  2. Small or Home Office Owner
  3. Insurance Type

And I used dummies to encode these multi-level categoricals:

  1. Nationality Based on Name
  2. Shopper Type
  3. Socioeconomic Status - LowRes
  4. Family Type - LowRes
  5. MoneyType__Primary
  6. Energy Consumption Type
  7. Consumption Channel Type
  8. Bldg: Building Type
  9. RR4: Life Stage Type - LowRes
  10. Socioeconomic Status - HighRes
  11. Family Type - HighRes
  12. Vacation Habits
  13. RR4: Life Stage Type - HighRes

Step 1.2.2: Engineer Mixed-Type Features

There are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis. There are two in particular that deserve attention; the handling of the rest are up to your own choices:

  • "PRAEGENDE_JUGENDJAHRE" combines information on three dimensions: generation by decade, movement (mainstream vs. avantgarde), and nation (east vs. west). While there aren't enough levels to disentangle east from west, you should create two new variables to capture the other two dimensions: an interval-type variable for decade, and a binary variable for movement.
  • "CAMEO_INTL_2015" combines information on two axes: wealth and life stage. Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values).
  • If you decide to keep or engineer new features around the other mixed-type features, make sure you note your steps in the Discussion section.

Be sure to check Data_Dictionary.md for the details needed to finish these tasks.

Mapping for Generation Designation Re-Encoding:

Original Code Original Code Meaning Decade Code Movement Code
1 40s - war years (Mainstream, E+W) 0 0
2 40s - reconstruction years (Avantgarde, E+W) 0 1
3 50s - economic miracle (Mainstream, E+W) 1 0
4 50s - milk bar / Individualisation (Avantgarde, E+W) 1 1
5 60s - economic miracle (Mainstream, E+W) 2 0
6 60s - generation 68 / student protestors (Avantgarde, W) 2 1
7 60s - opponents to the building of the Wall (Avantgarde, E) 2 1
8 70s - family orientation (Mainstream, E+W) 3 0
9 70s - peace movement (Avantgarde, E+W) 3 1
10 80s - Generation Golf (Mainstream, W) 4 0
11 80s - ecological awareness (Avantgarde, W) 4 1
12 80s - FDJ / communist party youth organisation (Mainstream, E) 4 0
13 80s - Swords into ploughshares (Avantgarde, E) 4 1
14 90s - digital media kids (Mainstream, E+W) 5 0
15 90s - ecological awareness (Avantgarde, E+W) 5 1
In [47]:
def data_mapper(series, mapping_dict):
    '''
    Reads in a pandas Series object that represents the Generation Designation feature
    and returns a re-encoded series according to mapping_dict.

    Inputs
    ------
    series: pandas Series of integer codes (1 through 15) representing different
        Generation Designation values

    mapping_dict: dict of form {designation_code: new_code} used to determine what
        values to return


    Returns
    -------
    pandas Series with the new codes
    '''
    
    # Since NaN values aren't always propagated as expected, do a quick check
    print(f"There are {series.isnull().sum()} null values in the series \
    {series.name} prior to extraction")
    
    out = series.map(mapping_dict, na_action = 'ignore')
    
    print(f"There are {out.isnull().sum()} null values in the series \
    {series.name} after extraction")
    
    return out              
In [48]:
decade_code_map = {
    1: 0,
    2: 0,
    3: 1,
    4: 1,
    5: 2,
    6: 2,
    7: 2,
    8: 3,
    9: 3,
    10: 4,
    11: 4,
    12: 4,
    13: 4,
    14: 5,
    15: 5
}
In [49]:
# Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.

# Variable 1: generation by decade
gen_pop_lowMissing['Generation Decade'] = \
    data_mapper(gen_pop_lowMissing['Generation Designation'],
                              decade_code_map)
There are 0 null values in the series     Generation Designation prior to extraction
There are 0 null values in the series     Generation Designation after extraction
In [50]:
# Does it seem like this worked?
# Should be min of 0 and max of 5 

gen_pop_lowMissing['Generation Decade'].describe()
Out[50]:
count    629236.00000
mean          3.31650
std           1.45395
min           0.00000
25%           2.00000
50%           3.00000
75%           5.00000
max           5.00000
Name: Generation Decade, dtype: float64
In [51]:
gen_pop_lowMissing['Generation Decade'].name
Out[51]:
'Generation Decade'
In [52]:
movement_code_map = {
    1: 0,
    2: 1,
    3: 0,
    4: 1,
    5: 0,
    6: 1,
    7: 1,
    8: 0,
    9: 1,
    10: 0,
    11: 1,
    12: 0,
    13: 1,
    14: 0,
    15: 1
}
In [53]:
# Feature: "PRAEGENDE_JUGENDJAHRE"

# Variable 2: movement type
gen_pop_lowMissing['Generation Movement'] = \
    data_mapper(gen_pop_lowMissing['Generation Designation'],
                              movement_code_map)
There are 0 null values in the series     Generation Designation prior to extraction
There are 0 null values in the series     Generation Designation after extraction
In [54]:
# Does it seem like this worked?
# Should be min of 0 and max of 1

gen_pop_lowMissing['Generation Movement'].describe()
Out[54]:
count    629236.000000
mean          0.233882
std           0.423298
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: Generation Movement, dtype: float64
In [55]:
# Drop Generation Designation feature now that we've captured its information in the new features
gen_pop_lowMissing.drop(columns='Generation Designation', inplace=True)

Mapping for RR4: Life Stage Type - Int'l Code Mapping Re-Encoding:

Original Code Original Code Meaning Wealth Code Life Stage Code
11 Wealthy Households - Pre-Family Couples & Singles 1 1
12 Wealthy Households - Young Couples With Children 1 2
13 Wealthy Households - Families With School Age Children 1 3
14 Wealthy Households - Older Families & Mature Couples 1 4
15 Wealthy Households - Elders In Retirement 1 5
21 Prosperous Households - Pre-Family Couples & Singles 2 1
22 Prosperous Households - Young Couples With Children 2 2
23 Prosperous Households - Families With School Age Children 2 3
24 Prosperous Households - Older Families & Mature Couples 2 4
25 Prosperous Households - Elders In Retirement 2 5
31 Comfortable Households - Pre-Family Couples & Singles 3 1
32 Comfortable Households - Young Couples With Children 3 2
33 Comfortable Households - Families With School Age Children 3 3
34 Comfortable Households - Older Families & Mature Couples 3 4
35 Comfortable Households - Elders In Retirement 3 5
41 Less Affluent Households - Pre-Family Couples & Singles 4 1
42 Less Affluent Households - Young Couples With Children 4 2
43 Less Affluent Households - Families With School Age Children 4 3
44 Less Affluent Households - Older Families & Mature Couples 4 4
45 Less Affluent Households - Elders In Retirement 4 5
51 Poorer Households - Pre-Family Couples & Singles 5 1
52 Poorer Households - Young Couples With Children 5 2
53 Poorer Households - Families With School Age Children 5 3
54 Poorer Households - Older Families & Mature Couples 5 4
55 Poorer Households - Elders In Retirement 5 5
In [56]:
# Investigate "CAMEO_INTL_2015" and engineer two new variables.

# "CAMEO_INTL_2015" combines information on two axes: wealth and life stage. 
# Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new 
# ordinal variables (which, for the purposes of this project, is equivalent to just treating 
# them as their raw numeric values).

# Cast as int since it seems to keep coming up as object type
gen_pop_lowMissing["RR4: Life Stage Type - Int'l Code Mapping"] =\
gen_pop_lowMissing["RR4: Life Stage Type - Int'l Code Mapping"].astype(int)

# Extract the tens digit - wealth code
gen_pop_lowMissing["RR4: Life Stage Type - Int'l - Wealth"] = \
(gen_pop_lowMissing["RR4: Life Stage Type - Int'l Code Mapping"] / 10).astype(int)

# Extract the ones digit - life stage code
gen_pop_lowMissing["RR4: Life Stage Type - Int'l - Stage"] = \
gen_pop_lowMissing["RR4: Life Stage Type - Int'l Code Mapping"] % 10
In [57]:
# Should be between 1 and 5
gen_pop_lowMissing["RR4: Life Stage Type - Int'l - Wealth"].describe()
Out[57]:
count    629236.000000
mean          3.329085
std           1.457449
min           1.000000
25%           2.000000
50%           4.000000
75%           5.000000
max           5.000000
Name: RR4: Life Stage Type - Int'l - Wealth, dtype: float64
In [58]:
# Should be between 1 and 5
gen_pop_lowMissing["RR4: Life Stage Type - Int'l - Stage"].describe()
Out[58]:
count    629236.000000
mean          2.873818
std           1.499416
min           1.000000
25%           1.000000
50%           3.000000
75%           4.000000
max           5.000000
Name: RR4: Life Stage Type - Int'l - Stage, dtype: float64
In [59]:
# Drop lingering RR4: Life Stage Type - Int'l Code Mapping feature
gen_pop_lowMissing.drop(columns="RR4: Life Stage Type - Int'l Code Mapping", inplace=True)
In [60]:
# What other features do I need to look at?

# Ignoring the two features I just worked on
feat_info_remaining[feat_info_remaining['type'] == 'mixed'].drop(labels=[22,59])
Out[60]:
attribute information_level type missing_or_unknown attribute_renamed
15 LP_LEBENSPHASE_FEIN person mixed [0] Life Stage - HighRes
16 LP_LEBENSPHASE_GROB person mixed [0] Life Stage - LowRes
56 WOHNLAGE building mixed [-1] Bldg: Neighborhood Quality
79 PLZ8_BAUMAX macrocell_plz8 mixed [-1, 0] PLZ8: Most Common Bldg Type

As you're typically better off when modeling to have more features instead of fewer ones (at least prior to feature selection and dimensionality reduction), especially if there is somewhat duplicative data within an original feature, I'm going to opt to split these features up whenever possible.

Here are my thoughts

  1. Life Stage - HighRes and - LowRes: lots of info, but inconsistent levels
    • These have information related to age, marital status, income level, homeownership status, and (birth?) household type, making them potentially rich for new generating new features
    • However, the levels are very inconsistent across the codes, making it impossible to pull out the same information across all original codes
      • Even if we were willing to drop information like we did earlier for the Generation Designation feature (East vs. West, if you recall), we couldn't do that here since some values have a completely different set of levels than others do.
    • As a result of all of this, I'm going to be treating these as though they're categorical, making dummies out of all of them
  2. Bldg: Neighborhood Quality: non-obvious ordering of values
    • Bldg: Neighborhood Quality seems at first glance to be clearly ordinal, until you realize that "rural" and "new building rural" aren't inherently less than a "very poor neighborhood"
    • The plan: we'll split this into a Rural categorical (dummy) feature and set any observations with the rural codes in the original feature to np.nan, so we can maintain the ordinal information as much as possible as well as the rural/urban split information, even though we won't have any quality information on the rural buildings (which we never had to begin with!)
  3. PLZ8: Most Common Bldg Type: mixed ordering
    • If I'm interpreting this one correctly, it refers to the size of buildings (e.g. how many families can be housed within one). That would seem to me to be an indication of square footage, but since multi-family buildings are measured differently than businesses (e.g. a convenience store is much smaller than a 10-family home, but a grocery store is much larger), it's not appropriate to keep them on the same ordinal scale.
    • That being said, the other PLZ8 features essentially capture the scale of majority building type if it is a certain size of multi-family home. The only new information provided in this feature is if it is primarily business buildings, so we'll simply make it a binary feature indicating if business buildings are the primary property type or not.
In [61]:
# First, the easy ones: dummies for Life Stage - HighRes, and Life Stage - LowRes

cat_cols = ['Life Stage - HighRes',
           'Life Stage - LowRes']

gen_pop_lowMissing = pd.get_dummies(
    gen_pop_lowMissing, prefix_sep='__', 
    drop_first=True, dummy_na=True,
    columns=cat_cols)

gen_pop_lowMissing.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 629236 entries, 1 to 891220
Columns: 250 entries, Age Bin to Life Stage - LowRes__nan
dtypes: float64(38), int64(26), uint8(186)
memory usage: 423.7 MB

Mapping for Bldg: Neighborhood Quality Re-Encoding:

Original Code Original Code Meaning Neighborhood Quality Code Rural Code
0 no score calculated 0 np.nan
1 very good neighborhood 1 0
2 good neighborhood 2 0
3 average neighborhood 3 0
4 poor neighborhood 4 0
5 very poor neighborhood 5 0
7 rural neighborhood 0 1
8 new building in rural neighborhood 0 2
In [62]:
# Map the rural codes to a new feature

rural_code_map = {
    0: np.nan,
    1: 0,
    2: 0,
    3: 0,
    4: 0,
    5: 0,
    7: 1,
    8: 2
}

gen_pop_lowMissing['Bldg: Rural Type'] = \
    data_mapper(gen_pop_lowMissing['Bldg: Neighborhood Quality'],
                rural_code_map)
There are 0 null values in the series     Bldg: Neighborhood Quality prior to extraction
There are 4 null values in the series     Bldg: Neighborhood Quality after extraction
In [63]:
# Make dummies out of the multi-level Rural Type categorical
cat_cols = ['Bldg: Rural Type']

gen_pop_lowMissing = pd.get_dummies(gen_pop_lowMissing,
                                    prefix_sep='__',
                                    drop_first=True,
                                    dummy_na=True,
                                    columns=cat_cols)
In [64]:
# Make the rural codes effectively unscored
neighborhood_code_map = {
    0: 0,
    1: 1,
    2: 2,
    3: 3,
    4: 4,
    5: 5,
    7: 0,
    8: 0
}

gen_pop_lowMissing['Bldg: Neighborhood Quality'] = \
    data_mapper(gen_pop_lowMissing['Bldg: Neighborhood Quality'],
                              neighborhood_code_map)
There are 0 null values in the series     Bldg: Neighborhood Quality prior to extraction
There are 0 null values in the series     Bldg: Neighborhood Quality after extraction

Mapping for PLZ8: Most Common Bldg Type Re-Encoding:

Original Code Original Code Meaning Business Bldg Code
1 mainly 1-2 family homes 0
2 mainly 3-5 family homes 0
3 mainly 6-10 family homes 0
4 mainly 10+ family homes 0
5 mainly business buildings 1
In [65]:
biz_code_map = {
    1: 0,
    2: 0,
    3: 0,
    4: 0,
    5: 1
}

gen_pop_lowMissing['PLZ8: Primarily Business Bldgs'] = \
    data_mapper(gen_pop_lowMissing['PLZ8: Most Common Bldg Type'],
                              biz_code_map)
There are 0 null values in the series     PLZ8: Most Common Bldg Type prior to extraction
There are 0 null values in the series     PLZ8: Most Common Bldg Type after extraction
In [66]:
# Drop original feature, it's no longer useful
gen_pop_lowMissing.drop(columns=['PLZ8: Most Common Bldg Type'],
                         inplace=True)

Discussion 1.2.2: Engineer Mixed-Type Features

Summarizing all of my earlier notes on the mixed-type features:

  1. Generation Designation (PRAEGENDE_JUGENDJAHRE) was split into a feature focused on the decade of that person's generation and the movement they were associated with from that decade (avantgarde or mainstream)
  2. For RR4: Life Stage Type - Int'l Code Mapping (CAMEO_INTL_2015), I split the feature into two new features that extracted information from the original feature about a person's wealth level and their current life stage.
  3. As they didn't seem to have any meaningful ordinal/interval/numeric data that I needed to salvage, I decided to treat Life Stage - HighRes and Life Stage - LowRes as categorical variables and made dummies from them
  4. With Bldg: Neighborhood Quality, I made the assumption based on the levels of this variable that likely any codes that didn't mention rural buildings reflected areas that were not primarily rural. As such, I extracted a new multi-level categorical feature that indicates if a person's building is primarily in a rural neighborhood, in a rural neighborhood but a new building, or not in a rural neighborhood at all. I then made dummies out of this new feature. I otherwise left the feature as it was, since the neighborhood quality score (presumably for non-rural neighborhoods only) is of interval type (or at least ordinal).
  5. For our final mixed-type feature, PLZ8: Most Common Bldg Type, I investigated the feature's allowed levels a bit and determined that it was mostly redundant with the other PLZ8 features we already have. That being said, I determined that it shouldn't be dropped as it contained information about how prominent business buildings are in the PLZ8 region, something I didn't see reflected in the other features. So I stripped that information out of the feature as a binary feature unto itself and dropped the rest of the information.

Step 1.2.3: Complete Feature Selection

In order to finish this step up, you need to make sure that your data frame now only has the columns that you want to keep. To summarize, the dataframe should consist of the following:

  • All numeric, interval, and ordinal type columns from the original dataset.
  • Binary categorical features (all numerically-encoded).
  • Engineered features from other multi-level categorical features and mixed features.

Make sure that for any new columns that you have engineered, that you've excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project. For example, you should not keep "PRAEGENDE_JUGENDJAHRE", since its values won't be useful for the algorithm: only the values derived from it in the engineered features you created should be retained. As a reminder, your data should only be from the subset with few or no missing values.

In [67]:
# Do whatever you need to in order to ensure that the dataframe only contains
# the columns that should be passed to the algorithm functions.

gen_pop_lowMissing.columns.values
Out[67]:
array(['Age Bin', 'Gender', 'MoneyType__Minimalist', 'MoneyType__Saver',
       'MoneyType__Preparer', 'MoneyType__Investor',
       'MoneyType__Inconspicuous', 'MoneyType__Homeowner', 'Birth Year',
       'Young Environmentalist', 'Health Type', 'Customer Type',
       'PersonalityType__Social', 'PersonalityType__Family',
       'PersonalityType__Religious', 'PersonalityType__Materialist',
       'PersonalityType__Dreamer', 'PersonalityType__Sensual',
       'PersonalityType__Event', 'PersonalityType__Cultured',
       'PersonalityType__Rational', 'PersonalityType__Critical',
       'PersonalityType__Dominant', 'PersonalityType__Combative',
       'PersonalityType__Dutiful', 'PersonalityType__Traditional',
       'HH: Number of Adults in HH',
       'HH: Number of Academic Title Holders in HH',
       'HH: Net Income Bins', 'HH: Probability of Children in Residence',
       'HH: Length of Residency (bins)', 'Bldg: Number of HHs',
       'Bldg: Number of Academic Title Holders',
       'Bldg: Distance to Point of Sale Category',
       "Bldg: Year of Building's Initial Database Entry",
       'Bldg: Neighborhood Quality', 'RR3: Bins of 1-2 Family Homes',
       'RR3: Bins of 3-5 Family Homes', 'RR3: Bins of 6-10 Family Homes',
       'RR3: Bins of 10+ Family Homes', 'RR3: Bin Counts of Bldgs',
       'PostCode: Distance to Nearest Urban Center (bins)',
       'PostCode: Density of HHs per km^2 (bins)',
       'PostCode: Distance to City Center (bins)',
       'RR1: Residential-Commercial Activity Ratio (categories)',
       'RR1: Purchasing Power (bins)', 'RR1: Movement Patterns',
       'RR1: Online Affinity', 'RR1: Neighborhood Type',
       'PLZ8: Number of Cars', 'PLZ8: Bins of 1-2 Family Homes',
       'PLZ8: Bins of 3-5 Family Homes',
       'PLZ8: Bins of 6-10 Family Homes',
       'PLZ8: Bins of 10+ Family Homes', 'PLZ8: Bin Counts of HHs',
       'PLZ8: Bin Counts of Bldgs', 'Community: Share of Unemployment',
       'Community: Size (bins)',
       'Community: Share of Unemployment Relative to Parent County',
       'Bldg: Location Relative to E or W Germany__W',
       'Bldg: Location Relative to E or W Germany__nan',
       'Small or Home Office Owner__1.0',
       'Small or Home Office Owner__nan', 'Insurance Type__2.0',
       'Insurance Type__nan', 'Nationality Based on Name__2.0',
       'Nationality Based on Name__3.0', 'Nationality Based on Name__nan',
       'Shopper Type__1.0', 'Shopper Type__2.0', 'Shopper Type__3.0',
       'Shopper Type__nan', 'Socioeconomic Status - LowRes__2.0',
       'Socioeconomic Status - LowRes__3.0',
       'Socioeconomic Status - LowRes__4.0',
       'Socioeconomic Status - LowRes__5.0',
       'Socioeconomic Status - LowRes__nan', 'Family Type - LowRes__2.0',
       'Family Type - LowRes__3.0', 'Family Type - LowRes__4.0',
       'Family Type - LowRes__5.0', 'Family Type - LowRes__nan',
       'MoneyType__Primary__2.0', 'MoneyType__Primary__3.0',
       'MoneyType__Primary__4.0', 'MoneyType__Primary__5.0',
       'MoneyType__Primary__6.0', 'MoneyType__Primary__nan',
       'Energy Consumption Type__2.0', 'Energy Consumption Type__3.0',
       'Energy Consumption Type__4.0', 'Energy Consumption Type__5.0',
       'Energy Consumption Type__6.0', 'Energy Consumption Type__nan',
       'Consumption Channel Type__2.0', 'Consumption Channel Type__3.0',
       'Consumption Channel Type__4.0', 'Consumption Channel Type__5.0',
       'Consumption Channel Type__6.0', 'Consumption Channel Type__nan',
       'Bldg: Building Type__2.0', 'Bldg: Building Type__3.0',
       'Bldg: Building Type__4.0', 'Bldg: Building Type__5.0',
       'Bldg: Building Type__6.0', 'Bldg: Building Type__8.0',
       'Bldg: Building Type__nan', 'RR4: Life Stage Type - LowRes__2',
       'RR4: Life Stage Type - LowRes__3',
       'RR4: Life Stage Type - LowRes__4',
       'RR4: Life Stage Type - LowRes__5',
       'RR4: Life Stage Type - LowRes__6',
       'RR4: Life Stage Type - LowRes__7',
       'RR4: Life Stage Type - LowRes__8',
       'RR4: Life Stage Type - LowRes__9',
       'RR4: Life Stage Type - LowRes__nan',
       'Socioeconomic Status - HighRes__2.0',
       'Socioeconomic Status - HighRes__3.0',
       'Socioeconomic Status - HighRes__4.0',
       'Socioeconomic Status - HighRes__5.0',
       'Socioeconomic Status - HighRes__6.0',
       'Socioeconomic Status - HighRes__7.0',
       'Socioeconomic Status - HighRes__8.0',
       'Socioeconomic Status - HighRes__9.0',
       'Socioeconomic Status - HighRes__10.0',
       'Socioeconomic Status - HighRes__nan',
       'Family Type - HighRes__2.0', 'Family Type - HighRes__3.0',
       'Family Type - HighRes__4.0', 'Family Type - HighRes__5.0',
       'Family Type - HighRes__6.0', 'Family Type - HighRes__7.0',
       'Family Type - HighRes__8.0', 'Family Type - HighRes__9.0',
       'Family Type - HighRes__10.0', 'Family Type - HighRes__11.0',
       'Family Type - HighRes__nan', 'Vacation Habits__2.0',
       'Vacation Habits__3.0', 'Vacation Habits__4.0',
       'Vacation Habits__5.0', 'Vacation Habits__6.0',
       'Vacation Habits__7.0', 'Vacation Habits__8.0',
       'Vacation Habits__9.0', 'Vacation Habits__10.0',
       'Vacation Habits__11.0', 'Vacation Habits__12.0',
       'Vacation Habits__nan', 'RR4: Life Stage Type - HighRes__1B',
       'RR4: Life Stage Type - HighRes__1C',
       'RR4: Life Stage Type - HighRes__1D',
       'RR4: Life Stage Type - HighRes__1E',
       'RR4: Life Stage Type - HighRes__2A',
       'RR4: Life Stage Type - HighRes__2B',
       'RR4: Life Stage Type - HighRes__2C',
       'RR4: Life Stage Type - HighRes__2D',
       'RR4: Life Stage Type - HighRes__3A',
       'RR4: Life Stage Type - HighRes__3B',
       'RR4: Life Stage Type - HighRes__3C',
       'RR4: Life Stage Type - HighRes__3D',
       'RR4: Life Stage Type - HighRes__4A',
       'RR4: Life Stage Type - HighRes__4B',
       'RR4: Life Stage Type - HighRes__4C',
       'RR4: Life Stage Type - HighRes__4D',
       'RR4: Life Stage Type - HighRes__4E',
       'RR4: Life Stage Type - HighRes__5A',
       'RR4: Life Stage Type - HighRes__5B',
       'RR4: Life Stage Type - HighRes__5C',
       'RR4: Life Stage Type - HighRes__5D',
       'RR4: Life Stage Type - HighRes__5E',
       'RR4: Life Stage Type - HighRes__5F',
       'RR4: Life Stage Type - HighRes__6A',
       'RR4: Life Stage Type - HighRes__6B',
       'RR4: Life Stage Type - HighRes__6C',
       'RR4: Life Stage Type - HighRes__6D',
       'RR4: Life Stage Type - HighRes__6E',
       'RR4: Life Stage Type - HighRes__6F',
       'RR4: Life Stage Type - HighRes__7A',
       'RR4: Life Stage Type - HighRes__7B',
       'RR4: Life Stage Type - HighRes__7C',
       'RR4: Life Stage Type - HighRes__7D',
       'RR4: Life Stage Type - HighRes__7E',
       'RR4: Life Stage Type - HighRes__8A',
       'RR4: Life Stage Type - HighRes__8B',
       'RR4: Life Stage Type - HighRes__8C',
       'RR4: Life Stage Type - HighRes__8D',
       'RR4: Life Stage Type - HighRes__9A',
       'RR4: Life Stage Type - HighRes__9B',
       'RR4: Life Stage Type - HighRes__9C',
       'RR4: Life Stage Type - HighRes__9D',
       'RR4: Life Stage Type - HighRes__9E',
       'RR4: Life Stage Type - HighRes__nan', 'Generation Decade',
       'Generation Movement', "RR4: Life Stage Type - Int'l - Wealth",
       "RR4: Life Stage Type - Int'l - Stage",
       'Life Stage - HighRes__2.0', 'Life Stage - HighRes__3.0',
       'Life Stage - HighRes__4.0', 'Life Stage - HighRes__5.0',
       'Life Stage - HighRes__6.0', 'Life Stage - HighRes__7.0',
       'Life Stage - HighRes__8.0', 'Life Stage - HighRes__9.0',
       'Life Stage - HighRes__10.0', 'Life Stage - HighRes__11.0',
       'Life Stage - HighRes__12.0', 'Life Stage - HighRes__13.0',
       'Life Stage - HighRes__14.0', 'Life Stage - HighRes__15.0',
       'Life Stage - HighRes__16.0', 'Life Stage - HighRes__17.0',
       'Life Stage - HighRes__18.0', 'Life Stage - HighRes__19.0',
       'Life Stage - HighRes__20.0', 'Life Stage - HighRes__21.0',
       'Life Stage - HighRes__22.0', 'Life Stage - HighRes__23.0',
       'Life Stage - HighRes__24.0', 'Life Stage - HighRes__25.0',
       'Life Stage - HighRes__26.0', 'Life Stage - HighRes__27.0',
       'Life Stage - HighRes__28.0', 'Life Stage - HighRes__29.0',
       'Life Stage - HighRes__30.0', 'Life Stage - HighRes__31.0',
       'Life Stage - HighRes__32.0', 'Life Stage - HighRes__33.0',
       'Life Stage - HighRes__34.0', 'Life Stage - HighRes__35.0',
       'Life Stage - HighRes__36.0', 'Life Stage - HighRes__37.0',
       'Life Stage - HighRes__38.0', 'Life Stage - HighRes__39.0',
       'Life Stage - HighRes__40.0', 'Life Stage - HighRes__nan',
       'Life Stage - LowRes__2.0', 'Life Stage - LowRes__3.0',
       'Life Stage - LowRes__4.0', 'Life Stage - LowRes__5.0',
       'Life Stage - LowRes__6.0', 'Life Stage - LowRes__7.0',
       'Life Stage - LowRes__8.0', 'Life Stage - LowRes__9.0',
       'Life Stage - LowRes__10.0', 'Life Stage - LowRes__11.0',
       'Life Stage - LowRes__12.0', 'Life Stage - LowRes__nan',
       'Bldg: Rural Type__1.0', 'Bldg: Rural Type__2.0',
       'Bldg: Rural Type__nan', 'PLZ8: Primarily Business Bldgs'],
      dtype=object)

The primary thing I was worried about was remembering to drop features that we had extracted information out of. I don't want a bunch of redundant information lying about during modeling. Looking through the column names, it appears I've managed to avoid this, so I'm satisfied!

Step 1.3: Create a Cleaning Function

Even though you've finished cleaning up the general population demographics data, it's important to look ahead to the future and realize that you'll need to perform the same cleaning steps on the customer demographics data. In this substep, complete the function below to execute the main feature selection, encoding, and re-engineering steps you performed above. Then, when it comes to looking at the customer data in Step 3, you can just run this function on that DataFrame to get the trimmed dataset in a single step.

In [71]:
def clean_data(df, feature_summary):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data
    
    INPUTS
    ------
    df: Demographics DataFrame
    
    feature_summary: DataFrame that includes listing of the various missing value codes
        for each feature
        
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """
    
    # -------------------------------------------------------------------
    # convert missing value codes into NaNs, ...

    # Convert the strings for the missing values from the feature summary
    # To be proper lists of values to use for filling in NaNs

    # First, remove brackets
    # Then, split on comma separator
    feature_summary.loc[:, 'missing_or_unknown'] = \
        feature_summary.loc[:, 'missing_or_unknown'].str[1:-1].str.split(',')



    def fill_missing(df, missing_codes_mapping, inplace=False):
        '''
        Parses dataframe of missing values and their mapping to individual feature names
        and then fills any of those values found in a dataframe's matching feature columns
        with np.nan.

        Inputs
        ------
        df: pandas DataFrame. Table with features that match the ones for which we have
            missing mappings. Each sample is a person.

        missing_codes_mapping: pandas DataFrame. Contains columns 'attribute' and 
            'missing_or_unknown' that map codes used for missing/unknown values to 
            features/attributes. 'missing_or_unknown' is expected to have elements
            that are lists of str (usually ints, but sometimes chars or empty lists).

        Returns
        -------
        df with NaN values filled in according to missing_codes_mapping
        '''

        # Use deep copy if inplace = False, otherwise use actual inputs
        if inplace:
            data = df
            missing_codes = missing_codes_mapping
        else:            
            data = df.copy(deep=True)
            missing_codes = missing_codes_mapping.copy(deep=True)

        def parse_missing_codes(code_list):
            '''
            Goes through a list of str and converts the elements of the list according to the needs 
            of the dtypes in our demographic data such that the results can be used for 
            filling in NaN values.

            Inputs
            ------
            code_list: list of str. List is expected to contain the chars, floats, or ints 
                that are codes indicating a missing or unknown value.

            Returns
            -------
            list or np.nan. Each element of the list returned is typecast according to 
                the expected needs of the NaN-filling it will be doing. Empty lists
                (or lists with only an empty string in them) are returned as np.nan.
            '''

            # Make sure list isn't just empty string
            if '' not in code_list:
                # Check if list can be converted to int without issues - if so, do it
                try:
                    return [int(e) for e in code_list]

                # Not all are cast-able to int
                except ValueError:
                    return [float(e) if 'X' not in e else e for e in code_list]

            else:
                return np.nan

        # Typecast missing value codes appropriately
        missing_codes.loc[:, 'missing_or_unknown'] = \
            missing_codes.loc[:, 'missing_or_unknown'].apply(parse_missing_codes)

        # Create series that maps feature names (index) to missing codes (data)
        code_map = pd.Series(data=missing_codes['missing_or_unknown'].values,
                             index=missing_codes['attribute'].values)

        # When passing a Series into to_replace, index is key and data is value (like a dict)
        data.replace(to_replace=code_map,
                     value=np.nan,
                     inplace=True)

        return data
    

    df = fill_missing(df, feature_summary)

    # ------------------------------------------------------------------    
    # remove selected columns and rows, ...

    # Removing outlier features, except the one that provides a birth year
    df.drop(columns=['ALTER_HH', 'KBA05_BAUMAX', 'KK_KUNDENTYP',
        'AGER_TYP', 'TITEL_KZ'], inplace = True)

    # If I'm going to make sense out of these, I need more helpful names

    # Read in the names mapping CSV as a dict
    new_names = pd.read_csv('col_renaming.csv', header=None, index_col=0,
                            squeeze=True).to_dict()

    df.rename(columns=new_names, inplace=True)

    # Remove rows having more than one missing value
    df = df.loc[df.isnull().sum(axis=1) < 2,:]

    
    # -------------------------------------------------------------------
    # select, re-encode, and engineer column values.

    # Re-encode categorical variable(s) to be kept in the analysis.
    cat_cols = ['Bldg: Location Relative to E or W Germany',
               'Small or Home Office Owner', 'Insurance Type',
               'Nationality Based on Name', 'Shopper Type',
                'Socioeconomic Status - LowRes', 'Family Type - LowRes',
                'MoneyType__Primary', 'Energy Consumption Type',
                'Consumption Channel Type', 'Bldg: Building Type',
                'RR4: Life Stage Type - LowRes', 'Socioeconomic Status - HighRes',
                'Family Type - HighRes', 'Vacation Habits', 
                'RR4: Life Stage Type - HighRes']

    # Have to drop first dummy so avoid Dummy Variable Trap
    # Have to include NaN dummy so zero-vector can't be interpreted ambiguously
        # as either NaN or first dropped dummy (which are likely different)
    df = pd.get_dummies(
        df, prefix_sep='__', 
        drop_first=True, dummy_na=True,
        columns=cat_cols)



    def data_mapper(series, mapping_dict):
        '''
        Reads in a pandas Series object that represents the Generation Designation feature
        and returns a re-encoded series according to mapping_dict.

        Inputs
        ------
        series: pandas Series of integer codes (1 through 15) representing different
            Generation Designation values

        mapping_dict: dict of form {designation_code: new_code} used to determine what
            values to return


        Returns
        -------
        pandas Series with the new codes
        '''
        
        # Since NaN values aren't always propagated as expected, do a quick check
        print(f"There are {series.isnull().sum()} null values in the series \
        {series.name} prior to extraction")
        
        out = series.map(mapping_dict, na_action = 'ignore')
        
        print(f"There are {out.isnull().sum()} null values in the series \
        {series.name} after extraction")
        
        return out              


    # For extracting decade of birth info from Generation Designation feature
    decade_code_map = {
    1: 0,
    2: 0,
    3: 1,
    4: 1,
    5: 2,
    6: 2,
    7: 2,
    8: 3,
    9: 3,
    10: 4,
    11: 4,
    12: 4,
    13: 4,
    14: 5,
    15: 5
    }

    df['Generation Decade'] = \
    data_mapper(df['Generation Designation'],
        decade_code_map)

    # For extracting generational movement from Generation Designation feature
    movement_code_map = {
    1: 0,
    2: 1,
    3: 0,
    4: 1,
    5: 0,
    6: 1,
    7: 1,
    8: 0,
    9: 1,
    10: 0,
    11: 1,
    12: 0,
    13: 1,
    14: 0,
    15: 1
    }

    df['Generation Movement'] = \
    data_mapper(df['Generation Designation'],
        movement_code_map)

    # Drop Generation Designation feature now that we've captured
    # its information in the new features
    df.drop(columns='Generation Designation', inplace=True)


    # Cast as int since it seems to keep coming up as object type
    df["RR4: Life Stage Type - Int'l Code Mapping"] =\
    df["RR4: Life Stage Type - Int'l Code Mapping"].astype(int)

    # Extract the tens digit as an int - wealth code
    df["RR4: Life Stage Type - Int'l - Wealth"] = \
    (df["RR4: Life Stage Type - Int'l Code Mapping"] \
        / 10).astype(int)

    # Extract the ones digit - life stage code
    df["RR4: Life Stage Type - Int'l - Stage"] = \
    df["RR4: Life Stage Type - Int'l Code Mapping"] % 10

    # Drop lngering RR4: Life Stage Type - Int'l Code Mapping feature
    df.drop(
        columns="RR4: Life Stage Type - Int'l Code Mapping", inplace=True)

    # Make dummies for Life Stage - HighRes, and Life Stage - LowRes
    cat_cols = ['Life Stage - HighRes',
               'Life Stage - LowRes']

    df = pd.get_dummies(
        df, prefix_sep='__', 
        drop_first=True, dummy_na=True,
        columns=cat_cols)

    # For extracting rural neighborhood status from Neighborhood Quality
    rural_code_map = {
    0: np.nan,
    1: 0,
    2: 0,
    3: 0,
    4: 0,
    5: 0,
    7: 1,
    8: 2
    }

    df['Bldg: Rural Type'] = data_mapper(df['Bldg: Neighborhood Quality'],
        rural_code_map)

    # Make dummies out of the multi-level Rural Type categorical
    cat_cols = ['Bldg: Rural Type']

    df = pd.get_dummies(df,
        prefix_sep='__',
        drop_first=True,
        dummy_na=True,
        columns=cat_cols)

    # Exclude rural categories as though they weren't scored
    neighborhood_code_map = {
        0: 0,
        1: 1,
        2: 2,
        3: 3,
        4: 4,
        5: 5,
        7: 0,
        8: 0
        }

    df['Bldg: Neighborhood Quality'] = \
    data_mapper(df['Bldg: Neighborhood Quality'],
        neighborhood_code_map)

    # For extracting business building dominance and dropping info about rest
    biz_code_map = {
    1: 0,
    2: 0,
    3: 0,
    4: 0,
    5: 1
    }
    
    df['PLZ8: Primarily Business Bldgs'] = \
        data_mapper(df['PLZ8: Most Common Bldg Type'],
            biz_code_map)

    # Drop original feature, it's no longer useful
    df.drop(columns=['PLZ8: Most Common Bldg Type'],
                             inplace=True)


    # -------------------------------------------------------------------
    # Return the cleaned dataframe.
    
    return df
In [72]:
# Test out our new cleaning function

# Load in the general demographics data.
gen_pop = pd.read_csv('data/Udacity_AZDIAS_Subset.csv', sep=';')

# Load in the feature summary file and the new column name mapping
feat_info = pd.read_csv('data/AZDIAS_Feature_Summary.csv', sep=';')
feat_info['attribute_renamed'] = pd.read_csv('col_renaming.csv', header=None)[1]


gen_pop = clean_data(gen_pop, feat_info)
gen_pop.info()
There are 0 null values in the series         Generation Designation prior to extraction
There are 0 null values in the series         Generation Designation after extraction
There are 0 null values in the series         Generation Designation prior to extraction
There are 0 null values in the series         Generation Designation after extraction
There are 0 null values in the series         Bldg: Neighborhood Quality prior to extraction
There are 4 null values in the series         Bldg: Neighborhood Quality after extraction
There are 0 null values in the series         Bldg: Neighborhood Quality prior to extraction
There are 0 null values in the series         Bldg: Neighborhood Quality after extraction
There are 0 null values in the series         PLZ8: Most Common Bldg Type prior to extraction
There are 0 null values in the series         PLZ8: Most Common Bldg Type after extraction
<class 'pandas.core.frame.DataFrame'>
Int64Index: 629236 entries, 1 to 891220
Columns: 253 entries, Age Bin to PLZ8: Primarily Business Bldgs
dtypes: float64(36), int64(28), uint8(189)
memory usage: 425.5 MB

Step 2: Feature Transformation

Step 2.1: Apply Feature Scaling

Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. Starting from this part of the project, you'll want to keep an eye on the API reference page for sklearn to help you navigate to all of the classes and functions that you'll need. In this substep, you'll need to check the following:

  • sklearn requires that data not have missing values in order for its estimators to work properly. So, before applying the scaler to your data, make sure that you've cleaned the DataFrame of the remaining missing values. This can be as simple as just removing all data points with missing data, or applying an Imputer to replace all missing values. You might also try a more complicated procedure where you temporarily remove missing values in order to compute the scaling parameters before re-introducing those missing values and applying imputation. Think about how much missing data you have and what possible effects each approach might have on your analysis, and justify your decision in the discussion section below.
  • For the actual scaling function, a StandardScaler instance is suggested, scaling each feature to mean 0 and standard deviation 1.
  • For these classes, you can make use of the .fit_transform() method to both fit a procedure to the data as well as apply the transformation to the data at the same time. Don't forget to keep the fit sklearn objects handy, since you'll be applying them to the customer demographics data towards the end of the project.

Recalling what we saw towards the end of the preprocessing work, we know that about 60% of our rows have no missing values, with the remaining 40% having at least one missing value. Additionally, here's a reminder on our current situation with regards to missing values at a feature level:

In [73]:
missing = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(gen_pop),2)
missing.sort_values('total missing', ascending = False, inplace = True)
missing
Out[73]:
total missing percent missing
Birth Year 235839 0.37
HH: Probability of Children in Residence 5441 0.01
Bldg: Number of HHs 286 0.00
PLZ8: Number of Cars 275 0.00
Bldg: Distance to Point of Sale Category 25 0.00
RR4: Life Stage Type - HighRes__4D 0 0.00
RR4: Life Stage Type - HighRes__4C 0 0.00
RR4: Life Stage Type - HighRes__4E 0 0.00
RR4: Life Stage Type - HighRes__4B 0 0.00
RR4: Life Stage Type - HighRes__6C 0 0.00
RR4: Life Stage Type - HighRes__5B 0 0.00
RR4: Life Stage Type - HighRes__5C 0 0.00
RR4: Life Stage Type - HighRes__5D 0 0.00
RR4: Life Stage Type - HighRes__5E 0 0.00
RR4: Life Stage Type - HighRes__5F 0 0.00
RR4: Life Stage Type - HighRes__6A 0 0.00
RR4: Life Stage Type - HighRes__6B 0 0.00
RR4: Life Stage Type - HighRes__5A 0 0.00
RR4: Life Stage Type - HighRes__6D 0 0.00
RR4: Life Stage Type - HighRes__3D 0 0.00
RR4: Life Stage Type - HighRes__6E 0 0.00
RR4: Life Stage Type - HighRes__6F 0 0.00
RR4: Life Stage Type - HighRes__7A 0 0.00
RR4: Life Stage Type - HighRes__7B 0 0.00
RR4: Life Stage Type - HighRes__7C 0 0.00
RR4: Life Stage Type - HighRes__7D 0 0.00
RR4: Life Stage Type - HighRes__7E 0 0.00
RR4: Life Stage Type - HighRes__8A 0 0.00
RR4: Life Stage Type - HighRes__8B 0 0.00
RR4: Life Stage Type - HighRes__8C 0 0.00
RR4: Life Stage Type - HighRes__8D 0 0.00
RR4: Life Stage Type - HighRes__9A 0 0.00
RR4: Life Stage Type - HighRes__9B 0 0.00
RR4: Life Stage Type - HighRes__9C 0 0.00
RR4: Life Stage Type - HighRes__4A 0 0.00
Age Bin 0 0.00
RR4: Life Stage Type - HighRes__3C 0 0.00
Vacation Habits__9.0 0 0.00
Family Type - HighRes__7.0 0 0.00
Family Type - HighRes__8.0 0 0.00
Family Type - HighRes__9.0 0 0.00
Family Type - HighRes__10.0 0 0.00
... ... ...
Socioeconomic Status - HighRes__3.0 0 0.00
Socioeconomic Status - HighRes__4.0 0 0.00
Socioeconomic Status - HighRes__5.0 0 0.00
Socioeconomic Status - HighRes__6.0 0 0.00
Socioeconomic Status - HighRes__7.0 0 0.00
Socioeconomic Status - HighRes__8.0 0 0.00
Socioeconomic Status - HighRes__9.0 0 0.00
Socioeconomic Status - HighRes__10.0 0 0.00
Socioeconomic Status - HighRes__nan 0 0.00
Gender 0 0.00
Consumption Channel Type__nan 0 0.00
Consumption Channel Type__5.0 0 0.00
Shopper Type__1.0 0 0.00
Consumption Channel Type__4.0 0 0.00
Shopper Type__2.0 0 0.00
Shopper Type__3.0 0 0.00
Shopper Type__nan 0 0.00
Socioeconomic Status - LowRes__2.0 0 0.00
Socioeconomic Status - LowRes__3.0 0 0.00
Socioeconomic Status - LowRes__4.0 0 0.00
Socioeconomic Status - LowRes__5.0 0 0.00
Socioeconomic Status - LowRes__nan 0 0.00
Family Type - LowRes__2.0 0 0.00
Family Type - LowRes__3.0 0 0.00
Family Type - LowRes__4.0 0 0.00
Family Type - LowRes__5.0 0 0.00
Family Type - LowRes__nan 0 0.00
MoneyType__Primary__2.0 0 0.00
MoneyType__Primary__3.0 0 0.00
MoneyType__Primary__4.0 0 0.00
MoneyType__Primary__5.0 0 0.00
MoneyType__Primary__6.0 0 0.00
MoneyType__Primary__nan 0 0.00
Energy Consumption Type__2.0 0 0.00
Energy Consumption Type__3.0 0 0.00
Energy Consumption Type__4.0 0 0.00
Energy Consumption Type__5.0 0 0.00
Energy Consumption Type__6.0 0 0.00
Energy Consumption Type__nan 0 0.00
Consumption Channel Type__2.0 0 0.00
Consumption Channel Type__3.0 0 0.00
PLZ8: Primarily Business Bldgs 0 0.00

253 rows × 2 columns

Obviously, the feature that would require the most imputation is Birth Year. While I'm not thrilled about the fact that I'd be imputing this one substantially, my earlier logic on why it needs to be done (it's potentially a very useful age proxy variable, as Age Bin is very low resolution with 15-year bins) still holds.

  • All of this being said, if I find questionable results in the clustering produced later, I may opt to drop the Birth Year feature after all.

Let's take a look at the distributions of non-missing values for each of these features to get an idea of what we're working with here.

In [74]:
sns.distplot(gen_pop['Birth Year'].dropna())
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1cc273c8>
In [75]:
gen_pop['Birth Year'].describe()
Out[75]:
count    393397.000000
mean       1966.776254
std          17.713269
min        1900.000000
25%        1954.000000
50%        1967.000000
75%        1980.000000
max        2017.000000
Name: Birth Year, dtype: float64
In [76]:
code_map = {
    1: '< 30 years old',
    2: '30 - 45 years old',
    3: '46 - 60 years old',
    4: '> 60 years old'
}

sns.distplot(gen_pop['Age Bin'].dropna())
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x1aa5a8e1d0>

It looks like the majority Age Bin value (45-60 years old) corresponds roughly to the median Birth Year value of 1967 (given that the final year in the Birth Year feature is 2017, the midpoint of this bin would be 1965). Additionally, the Birth Year feature is left-skewed, which corresponds reasonably well with the Age Bin having its second peak for the oldest group (code 4, 60+ year olds). It seems like using Age Bin as a way to impute Birth Year is a pretty reasonable plan!

In [77]:
code_map = {
    1: 'most likely',
    2: 'very likely',
    3: 'likely',
    4: 'average',
    5: 'unlikely',
    6: 'very unlikely'
}

sns.distplot(gen_pop['HH: Probability of Children in Residence'].dropna())
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1c62b860>

As I can see this feature being correlated with at least a few other features (e.g. Family Types and other regional-level family-oriented features), I hesitate to use a simple imputation method like the mode. Also, this feature is missing relatively few values (about 0.8% of the dataset). Thus, I am going to simply drop the rows in which it is missing.

In [78]:
sns.distplot(gen_pop['Bldg: Number of HHs'].dropna())
Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1c708198>
In [79]:
gen_pop['Bldg: Number of HHs'].describe()
Out[79]:
count    628950.000000
mean          8.457550
std          15.431532
min           1.000000
25%           2.000000
50%           4.000000
75%          10.000000
max         536.000000
Name: Bldg: Number of HHs, dtype: float64

This one feels like it's on the cusp for me, decision-wise. I see a few regional-level features that might correlate with this one, so I'm hesitant to do a quick univariate imputation, but at the same time the amount of missing value is so low that I'm not too worried about improperly skewing the sample relative to the underlying population distribution. So for this one I think I'll err on the side of not losing more data than I have to and I'll impute using the median (the mean would be inappropriate, given that the distribution is so skewed).

In [80]:
sns.distplot(gen_pop['PLZ8: Number of Cars'].dropna())
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1bf64ef0>
In [81]:
gen_pop['PLZ8: Number of Cars'].describe()
Out[81]:
count    628961.000000
mean        611.853301
std         332.342837
min           1.000000
25%         382.000000
50%         543.000000
75%         766.000000
max        2300.000000
Name: PLZ8: Number of Cars, dtype: float64

Much like the previous feature, I feel like this one has so few missing values that a simple imputation approach wouldn't impact the overall distribution too strongly. And, as a bonus, this one doesn't seem to have any intuitively related features elsewhere, strengthening that argument. So we'll plan to also impute this one using the median, given its amount of skew.

In [82]:
code_map = {
    1: 'building is located in a 125 x 125m grid cell (RA1), which is a consumption cell',
    2: 'building is located in a 250 x 250m grid cell that includes at least one RA1-consumption cell',
    3: 'building is located in a 500 x 500m grid cell that includes at least one RA1-consumption cell',
    4: 'building is located in a 1 x 1km grid cell that includes at least one RA1-consumption cell',
    5: 'building is located in a 2 x 2km grid cell that includes at least one RA1-consumption cell',
    6: 'building is located in a 10 x 10km grid cell that includes at least one RA1-consumption cell',
    7: 'building is not located in a 10 x 10km range of a consumption cell'
}

sns.distplot(gen_pop['Bldg: Distance to Point of Sale Category'].dropna())
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1a1312e8>

This seems to effectively be a distance metric that can be used as a proxy to how urban or business-adjacent a building is. I see no reason to treat this one any differently than those that came before it. Let's go ahead and just use the mode value of 1 for this one, since it's ordinal (and thus mean or median would be inappropriate). I'm using the median instead of the mode here since this is ordinal in nature and could roughly be translated into a continuous distance variable with some interpolation.

My Imputation Plan
Feature Name % of Rows Missing Values Imputation Approach
Birth Year 37% Since Age Bin isn't missing any values, I'll use the midpoint of the bin for a given row as the Birth Year
HH: Probability of Children in Residence 1% Drop the rows wherein this is missing
Bldg: Number of HHs < 1% Impute using the median
PLZ8: Number of Cars < 1% Impute using the median
Bldg: Distance to Point of Sale Category < 1% Impute using mode
In [83]:
# If you've not yet cleaned the dataset of all NaN values, then investigate and
# do that now.

# Use Age Bin to impute Birth Year values, using midpoint of associated Age Bin to dictate year
age_bin_map = {
    1: '< 30 years old',
    2: '30 - 45 years old',
    3: '46 - 60 years old',
    4: '> 60 years old'
}

# Assume data are all relative to 2017, as this is the maximum Birth Year value
# Earliest Birth Year is 1900, making greatest age 117 
    # Thus we'll assume 20 year midpoint for code 4 (assuming anyone 100+ is outlier)
age_bin_to_year = {
    1: 2002,
    2: 1979,
    3: 1963,
    4: 1937
}

# Make a series that is a mapping of Age Bin to birth years
mapped_series = gen_pop['Age Bin'].map(age_bin_to_year)

# Use this mapped series and fillna() on Birth Year 
    # to take midpoint Age Bin year as Birth Year
gen_pop['Birth Year'].fillna(mapped_series, inplace=True)
In [84]:
# Recompute missing values

missing = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(gen_pop),2)
missing.sort_values('total missing', ascending = False, inplace = True)
missing
Out[84]:
total missing percent missing
HH: Probability of Children in Residence 5441 0.01
Bldg: Number of HHs 286 0.00
PLZ8: Number of Cars 275 0.00
Bldg: Distance to Point of Sale Category 25 0.00
RR4: Life Stage Type - HighRes__9A 0 0.00
RR4: Life Stage Type - HighRes__8D 0 0.00
RR4: Life Stage Type - HighRes__4B 0 0.00
RR4: Life Stage Type - HighRes__9C 0 0.00
RR4: Life Stage Type - HighRes__4C 0 0.00
RR4: Life Stage Type - HighRes__9B 0 0.00
RR4: Life Stage Type - HighRes__4D 0 0.00
RR4: Life Stage Type - HighRes__4E 0 0.00
RR4: Life Stage Type - HighRes__5A 0 0.00
RR4: Life Stage Type - HighRes__5B 0 0.00
RR4: Life Stage Type - HighRes__5C 0 0.00
RR4: Life Stage Type - HighRes__5D 0 0.00
RR4: Life Stage Type - HighRes__5E 0 0.00
RR4: Life Stage Type - HighRes__5F 0 0.00
RR4: Life Stage Type - HighRes__6A 0 0.00
RR4: Life Stage Type - HighRes__6B 0 0.00
RR4: Life Stage Type - HighRes__6C 0 0.00
RR4: Life Stage Type - HighRes__3D 0 0.00
RR4: Life Stage Type - HighRes__6D 0 0.00
RR4: Life Stage Type - HighRes__6E 0 0.00
RR4: Life Stage Type - HighRes__6F 0 0.00
RR4: Life Stage Type - HighRes__7A 0 0.00
RR4: Life Stage Type - HighRes__7B 0 0.00
RR4: Life Stage Type - HighRes__7C 0 0.00
RR4: Life Stage Type - HighRes__7D 0 0.00
RR4: Life Stage Type - HighRes__7E 0 0.00
RR4: Life Stage Type - HighRes__8A 0 0.00
RR4: Life Stage Type - HighRes__8B 0 0.00
RR4: Life Stage Type - HighRes__8C 0 0.00
RR4: Life Stage Type - HighRes__4A 0 0.00
Age Bin 0 0.00
RR4: Life Stage Type - HighRes__9E 0 0.00
Vacation Habits__9.0 0 0.00
Family Type - HighRes__7.0 0 0.00
Family Type - HighRes__8.0 0 0.00
Family Type - HighRes__9.0 0 0.00
Family Type - HighRes__10.0 0 0.00
Family Type - HighRes__11.0 0 0.00
... ... ...
Socioeconomic Status - HighRes__3.0 0 0.00
Socioeconomic Status - HighRes__4.0 0 0.00
Socioeconomic Status - HighRes__5.0 0 0.00
Socioeconomic Status - HighRes__6.0 0 0.00
Socioeconomic Status - HighRes__7.0 0 0.00
Socioeconomic Status - HighRes__8.0 0 0.00
Socioeconomic Status - HighRes__9.0 0 0.00
Socioeconomic Status - HighRes__10.0 0 0.00
Socioeconomic Status - HighRes__nan 0 0.00
Gender 0 0.00
Consumption Channel Type__nan 0 0.00
Consumption Channel Type__5.0 0 0.00
Shopper Type__1.0 0 0.00
Consumption Channel Type__4.0 0 0.00
Shopper Type__2.0 0 0.00
Shopper Type__3.0 0 0.00
Shopper Type__nan 0 0.00
Socioeconomic Status - LowRes__2.0 0 0.00
Socioeconomic Status - LowRes__3.0 0 0.00
Socioeconomic Status - LowRes__4.0 0 0.00
Socioeconomic Status - LowRes__5.0 0 0.00
Socioeconomic Status - LowRes__nan 0 0.00
Family Type - LowRes__2.0 0 0.00
Family Type - LowRes__3.0 0 0.00
Family Type - LowRes__4.0 0 0.00
Family Type - LowRes__5.0 0 0.00
Family Type - LowRes__nan 0 0.00
MoneyType__Primary__2.0 0 0.00
MoneyType__Primary__3.0 0 0.00
MoneyType__Primary__4.0 0 0.00
MoneyType__Primary__5.0 0 0.00
MoneyType__Primary__6.0 0 0.00
MoneyType__Primary__nan 0 0.00
Energy Consumption Type__2.0 0 0.00
Energy Consumption Type__3.0 0 0.00
Energy Consumption Type__4.0 0 0.00
Energy Consumption Type__5.0 0 0.00
Energy Consumption Type__6.0 0 0.00
Energy Consumption Type__nan 0 0.00
Consumption Channel Type__2.0 0 0.00
Consumption Channel Type__3.0 0 0.00
PLZ8: Primarily Business Bldgs 0 0.00

253 rows × 2 columns

In [85]:
# Drop null rows for HH: Probability of Children in Residence	
gen_pop.dropna(subset=['HH: Probability of Children in Residence'],
              inplace=True)
In [86]:
# Recompute missing values

missing = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(gen_pop),2)
missing.sort_values('total missing', ascending = False, inplace = True)
missing
Out[86]:
total missing percent missing
Bldg: Number of HHs 286 0.0
PLZ8: Number of Cars 275 0.0
Bldg: Distance to Point of Sale Category 25 0.0
RR4: Life Stage Type - HighRes__4A 0 0.0
RR4: Life Stage Type - HighRes__4B 0 0.0
RR4: Life Stage Type - HighRes__4C 0 0.0
RR4: Life Stage Type - HighRes__4D 0 0.0
RR4: Life Stage Type - HighRes__4E 0 0.0
RR4: Life Stage Type - HighRes__6C 0 0.0
RR4: Life Stage Type - HighRes__5B 0 0.0
RR4: Life Stage Type - HighRes__5C 0 0.0
RR4: Life Stage Type - HighRes__5D 0 0.0
RR4: Life Stage Type - HighRes__5E 0 0.0
RR4: Life Stage Type - HighRes__5F 0 0.0
RR4: Life Stage Type - HighRes__6A 0 0.0
RR4: Life Stage Type - HighRes__6B 0 0.0
RR4: Life Stage Type - HighRes__5A 0 0.0
Age Bin 0 0.0
RR4: Life Stage Type - HighRes__6D 0 0.0
RR4: Life Stage Type - HighRes__6E 0 0.0
RR4: Life Stage Type - HighRes__6F 0 0.0
RR4: Life Stage Type - HighRes__7A 0 0.0
RR4: Life Stage Type - HighRes__7B 0 0.0
RR4: Life Stage Type - HighRes__7C 0 0.0
RR4: Life Stage Type - HighRes__7D 0 0.0
RR4: Life Stage Type - HighRes__7E 0 0.0
RR4: Life Stage Type - HighRes__8A 0 0.0
RR4: Life Stage Type - HighRes__8B 0 0.0
RR4: Life Stage Type - HighRes__8C 0 0.0
RR4: Life Stage Type - HighRes__8D 0 0.0
RR4: Life Stage Type - HighRes__9A 0 0.0
RR4: Life Stage Type - HighRes__9B 0 0.0
RR4: Life Stage Type - HighRes__3D 0 0.0
RR4: Life Stage Type - HighRes__3C 0 0.0
RR4: Life Stage Type - HighRes__9D 0 0.0
Vacation Habits__8.0 0 0.0
Family Type - HighRes__6.0 0 0.0
Family Type - HighRes__7.0 0 0.0
Family Type - HighRes__8.0 0 0.0
Family Type - HighRes__9.0 0 0.0
Family Type - HighRes__10.0 0 0.0
Family Type - HighRes__11.0 0 0.0
... ... ...
Socioeconomic Status - HighRes__2.0 0 0.0
Socioeconomic Status - HighRes__3.0 0 0.0
Socioeconomic Status - HighRes__4.0 0 0.0
Socioeconomic Status - HighRes__5.0 0 0.0
Socioeconomic Status - HighRes__6.0 0 0.0
Socioeconomic Status - HighRes__7.0 0 0.0
Socioeconomic Status - HighRes__8.0 0 0.0
Socioeconomic Status - HighRes__9.0 0 0.0
Socioeconomic Status - HighRes__10.0 0 0.0
Socioeconomic Status - HighRes__nan 0 0.0
Consumption Channel Type__6.0 0 0.0
Consumption Channel Type__4.0 0 0.0
Nationality Based on Name__nan 0 0.0
Consumption Channel Type__3.0 0 0.0
Shopper Type__1.0 0 0.0
Shopper Type__2.0 0 0.0
Shopper Type__3.0 0 0.0
Shopper Type__nan 0 0.0
Socioeconomic Status - LowRes__2.0 0 0.0
Socioeconomic Status - LowRes__3.0 0 0.0
Socioeconomic Status - LowRes__4.0 0 0.0
Socioeconomic Status - LowRes__5.0 0 0.0
Socioeconomic Status - LowRes__nan 0 0.0
Family Type - LowRes__2.0 0 0.0
Family Type - LowRes__3.0 0 0.0
Family Type - LowRes__4.0 0 0.0
Family Type - LowRes__5.0 0 0.0
Family Type - LowRes__nan 0 0.0
MoneyType__Primary__2.0 0 0.0
MoneyType__Primary__3.0 0 0.0
MoneyType__Primary__4.0 0 0.0
MoneyType__Primary__5.0 0 0.0
MoneyType__Primary__6.0 0 0.0
MoneyType__Primary__nan 0 0.0
Energy Consumption Type__2.0 0 0.0
Energy Consumption Type__3.0 0 0.0
Energy Consumption Type__4.0 0 0.0
Energy Consumption Type__5.0 0 0.0
Energy Consumption Type__6.0 0 0.0
Energy Consumption Type__nan 0 0.0
Consumption Channel Type__2.0 0 0.0
PLZ8: Primarily Business Bldgs 0 0.0

253 rows × 2 columns

In [87]:
# Bldg: Number of HHs and PLZ8: Number of Cars - Impute using the median
from sklearn.impute import SimpleImputer

imp_median = SimpleImputer(strategy='median')

gen_pop[['Bldg: Number of HHs', 'PLZ8: Number of Cars']] = \
imp_median.fit_transform(gen_pop[['Bldg: Number of HHs', 
                                  'PLZ8: Number of Cars']])
In [88]:
# Recompute missing values

missing = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(gen_pop),2)
missing.sort_values('total missing', ascending = False, inplace = True)
missing
Out[88]:
total missing percent missing
Bldg: Distance to Point of Sale Category 25 0.0
RR4: Life Stage Type - HighRes__9D 0 0.0
RR4: Life Stage Type - HighRes__3D 0 0.0
RR4: Life Stage Type - HighRes__4A 0 0.0
RR4: Life Stage Type - HighRes__4B 0 0.0
RR4: Life Stage Type - HighRes__4C 0 0.0
RR4: Life Stage Type - HighRes__4D 0 0.0
RR4: Life Stage Type - HighRes__4E 0 0.0
RR4: Life Stage Type - HighRes__5A 0 0.0
RR4: Life Stage Type - HighRes__5B 0 0.0
RR4: Life Stage Type - HighRes__5C 0 0.0
RR4: Life Stage Type - HighRes__5D 0 0.0
RR4: Life Stage Type - HighRes__5E 0 0.0
RR4: Life Stage Type - HighRes__5F 0 0.0
RR4: Life Stage Type - HighRes__6A 0 0.0
RR4: Life Stage Type - HighRes__6B 0 0.0
Age Bin 0 0.0
RR4: Life Stage Type - HighRes__6D 0 0.0
RR4: Life Stage Type - HighRes__6E 0 0.0
RR4: Life Stage Type - HighRes__6F 0 0.0
RR4: Life Stage Type - HighRes__7A 0 0.0
RR4: Life Stage Type - HighRes__7B 0 0.0
RR4: Life Stage Type - HighRes__7C 0 0.0
RR4: Life Stage Type - HighRes__7D 0 0.0
RR4: Life Stage Type - HighRes__7E 0 0.0
RR4: Life Stage Type - HighRes__8A 0 0.0
RR4: Life Stage Type - HighRes__8B 0 0.0
RR4: Life Stage Type - HighRes__8C 0 0.0
RR4: Life Stage Type - HighRes__8D 0 0.0
RR4: Life Stage Type - HighRes__9A 0 0.0
RR4: Life Stage Type - HighRes__9B 0 0.0
RR4: Life Stage Type - HighRes__6C 0 0.0
RR4: Life Stage Type - HighRes__3C 0 0.0
RR4: Life Stage Type - HighRes__3B 0 0.0
Vacation Habits__7.0 0 0.0
Family Type - HighRes__5.0 0 0.0
Family Type - HighRes__6.0 0 0.0
Family Type - HighRes__7.0 0 0.0
Family Type - HighRes__8.0 0 0.0
Family Type - HighRes__9.0 0 0.0
Family Type - HighRes__10.0 0 0.0
Family Type - HighRes__11.0 0 0.0
... ... ...
RR4: Life Stage Type - LowRes__nan 0 0.0
Socioeconomic Status - HighRes__2.0 0 0.0
Socioeconomic Status - HighRes__3.0 0 0.0
Socioeconomic Status - HighRes__4.0 0 0.0
Socioeconomic Status - HighRes__5.0 0 0.0
Socioeconomic Status - HighRes__6.0 0 0.0
Socioeconomic Status - HighRes__7.0 0 0.0
Socioeconomic Status - HighRes__8.0 0 0.0
Socioeconomic Status - HighRes__9.0 0 0.0
Socioeconomic Status - HighRes__10.0 0 0.0
Consumption Channel Type__4.0 0 0.0
Consumption Channel Type__3.0 0 0.0
Consumption Channel Type__2.0 0 0.0
Energy Consumption Type__nan 0 0.0
Nationality Based on Name__3.0 0 0.0
Nationality Based on Name__nan 0 0.0
Shopper Type__1.0 0 0.0
Shopper Type__2.0 0 0.0
Shopper Type__3.0 0 0.0
Shopper Type__nan 0 0.0
Socioeconomic Status - LowRes__2.0 0 0.0
Socioeconomic Status - LowRes__3.0 0 0.0
Socioeconomic Status - LowRes__4.0 0 0.0
Socioeconomic Status - LowRes__5.0 0 0.0
Socioeconomic Status - LowRes__nan 0 0.0
Family Type - LowRes__2.0 0 0.0
Family Type - LowRes__3.0 0 0.0
Family Type - LowRes__4.0 0 0.0
Family Type - LowRes__5.0 0 0.0
Family Type - LowRes__nan 0 0.0
MoneyType__Primary__2.0 0 0.0
MoneyType__Primary__3.0 0 0.0
MoneyType__Primary__4.0 0 0.0
MoneyType__Primary__5.0 0 0.0
MoneyType__Primary__6.0 0 0.0
MoneyType__Primary__nan 0 0.0
Energy Consumption Type__2.0 0 0.0
Energy Consumption Type__3.0 0 0.0
Energy Consumption Type__4.0 0 0.0
Energy Consumption Type__5.0 0 0.0
Energy Consumption Type__6.0 0 0.0
PLZ8: Primarily Business Bldgs 0 0.0

253 rows × 2 columns

In [89]:
# Bldg: Distance to Point of Sale Category - Impute using mode
imp_mode = SimpleImputer(strategy='most_frequent')

gen_pop['Bldg: Distance to Point of Sale Category'] = \
imp_mode.fit_transform(gen_pop['Bldg: Distance to Point of Sale Category'].values.reshape(-1, 1))
In [90]:
# Recompute missing values

missing = pd.DataFrame(gen_pop.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(gen_pop),2)
missing.sort_values('total missing', ascending = False, inplace = True)
missing
Out[90]:
total missing percent missing
Age Bin 0 0.0
RR4: Life Stage Type - HighRes__6C 0 0.0
RR4: Life Stage Type - HighRes__4A 0 0.0
RR4: Life Stage Type - HighRes__4B 0 0.0
RR4: Life Stage Type - HighRes__4C 0 0.0
RR4: Life Stage Type - HighRes__4D 0 0.0
RR4: Life Stage Type - HighRes__4E 0 0.0
RR4: Life Stage Type - HighRes__5A 0 0.0
RR4: Life Stage Type - HighRes__5B 0 0.0
RR4: Life Stage Type - HighRes__5C 0 0.0
RR4: Life Stage Type - HighRes__5D 0 0.0
RR4: Life Stage Type - HighRes__5E 0 0.0
RR4: Life Stage Type - HighRes__5F 0 0.0
RR4: Life Stage Type - HighRes__6A 0 0.0
RR4: Life Stage Type - HighRes__6B 0 0.0
RR4: Life Stage Type - HighRes__6D 0 0.0
RR4: Life Stage Type - HighRes__3C 0 0.0
RR4: Life Stage Type - HighRes__6E 0 0.0
RR4: Life Stage Type - HighRes__6F 0 0.0
RR4: Life Stage Type - HighRes__7A 0 0.0
RR4: Life Stage Type - HighRes__7B 0 0.0
RR4: Life Stage Type - HighRes__7C 0 0.0
RR4: Life Stage Type - HighRes__7D 0 0.0
RR4: Life Stage Type - HighRes__7E 0 0.0
RR4: Life Stage Type - HighRes__8A 0 0.0
RR4: Life Stage Type - HighRes__8B 0 0.0
RR4: Life Stage Type - HighRes__8C 0 0.0
RR4: Life Stage Type - HighRes__8D 0 0.0
RR4: Life Stage Type - HighRes__9A 0 0.0
RR4: Life Stage Type - HighRes__9B 0 0.0
RR4: Life Stage Type - HighRes__3D 0 0.0
RR4: Life Stage Type - HighRes__3B 0 0.0
Family Type - HighRes__3.0 0 0.0
Vacation Habits__7.0 0 0.0
Family Type - HighRes__5.0 0 0.0
Family Type - HighRes__6.0 0 0.0
Family Type - HighRes__7.0 0 0.0
Family Type - HighRes__8.0 0 0.0
Family Type - HighRes__9.0 0 0.0
Family Type - HighRes__10.0 0 0.0
Family Type - HighRes__11.0 0 0.0
Family Type - HighRes__nan 0 0.0
... ... ...
RR4: Life Stage Type - LowRes__nan 0 0.0
Socioeconomic Status - HighRes__2.0 0 0.0
Socioeconomic Status - HighRes__3.0 0 0.0
Socioeconomic Status - HighRes__4.0 0 0.0
Socioeconomic Status - HighRes__5.0 0 0.0
Socioeconomic Status - HighRes__6.0 0 0.0
Socioeconomic Status - HighRes__7.0 0 0.0
Socioeconomic Status - HighRes__8.0 0 0.0
Socioeconomic Status - HighRes__9.0 0 0.0
Socioeconomic Status - HighRes__10.0 0 0.0
Consumption Channel Type__4.0 0 0.0
Consumption Channel Type__3.0 0 0.0
Consumption Channel Type__2.0 0 0.0
Energy Consumption Type__nan 0 0.0
Nationality Based on Name__3.0 0 0.0
Nationality Based on Name__nan 0 0.0
Shopper Type__1.0 0 0.0
Shopper Type__2.0 0 0.0
Shopper Type__3.0 0 0.0
Shopper Type__nan 0 0.0
Socioeconomic Status - LowRes__2.0 0 0.0
Socioeconomic Status - LowRes__3.0 0 0.0
Socioeconomic Status - LowRes__4.0 0 0.0
Socioeconomic Status - LowRes__5.0 0 0.0
Socioeconomic Status - LowRes__nan 0 0.0
Family Type - LowRes__2.0 0 0.0
Family Type - LowRes__3.0 0 0.0
Family Type - LowRes__4.0 0 0.0
Family Type - LowRes__5.0 0 0.0
Family Type - LowRes__nan 0 0.0
MoneyType__Primary__2.0 0 0.0
MoneyType__Primary__3.0 0 0.0
MoneyType__Primary__4.0 0 0.0
MoneyType__Primary__5.0 0 0.0
MoneyType__Primary__6.0 0 0.0
MoneyType__Primary__nan 0 0.0
Energy Consumption Type__2.0 0 0.0
Energy Consumption Type__3.0 0 0.0
Energy Consumption Type__4.0 0 0.0
Energy Consumption Type__5.0 0 0.0
Energy Consumption Type__6.0 0 0.0
PLZ8: Primarily Business Bldgs 0 0.0

253 rows × 2 columns

There you go! No more missing values!

In [91]:
# Apply feature scaling to the general population demographics data.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
gen_pop_scaled = scaler.fit_transform(gen_pop)

scaler.n_samples_seen_
Out[91]:
623795
My Imputation Plan, Again
Feature Name % of Rows Missing Values Imputation Approach
Birth Year 37% Since Age Bin isn't missing any values, I'll use the midpoint of the bin for a given row as the Birth Year
HH: Probability of Children in Residence 1% Drop the rows wherein this is missing
Bldg: Number of HHs < 1% Impute using the median
PLZ8: Number of Cars < 1% Impute using the median
Bldg: Distance to Point of Sale Category < 1% Impute using mode

Discussion 2.1: Apply Feature Scaling

I found that there were still 5 features with nonzero amounts of missing values: Birth Year, HH: Probability of Children in Residence, Bldg: Number of HHs, PLZ8: Number of Cars, and Bldg: Distance to Point of Sale Category. Of these, only Birth Year had a significant number of missing values relative to the total number of data rows (about 37%). The others were 1% or lower.

I utilized a variety of techniques to impute values for these features such as using the values of another highly related feature with no missing values to calculate likely values for the feature being imputed; utilizing constants from univariate statistics such as the median or the mode; or dropping the feature entirely. I've copied my imputation plan in a table above this cell as a reminder of what was specifically done for each feature in question.

After completing imputation and having a dataset with no missing values as a result, I standardized all features so that they would have zero mean and a variance of 1.

Step 2.2: Perform Dimensionality Reduction

On your scaled data, you are now ready to apply dimensionality reduction techniques.

  • Use sklearn's PCA class to apply principal component analysis on the data, thus finding the vectors of maximal variance in the data. To start, you should not set any parameters (so all components are computed) or set a number of components that is at least half the number of features (so there's enough features to see the general trend in variability).
  • Check out the ratio of variance explained by each principal component as well as the cumulative variance explained. Try plotting the cumulative or sequential values using matplotlib's plot() function. Based on what you find, select a value for the number of transformed features you'll retain for the clustering part of the project.
  • Once you've made a choice for the number of components to keep, make sure you re-fit a PCA instance to perform the decided-on transformation.
In [92]:
# Apply PCA to the data.

from sklearn.decomposition import PCA
pca = PCA()
pca.fit(gen_pop_scaled)
Out[92]:
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [93]:
# Investigate the variance accounted for by each principal component.

exp_var = pd.DataFrame(pca.explained_variance_ratio_, columns=['Explained Variance Ratio'])
exp_var.index.name = 'Principal Component'
exp_var['CumSum of Explained Variance Ratio'] = exp_var['Explained Variance Ratio'].cumsum()

exp_var
Out[93]:
Explained Variance Ratio CumSum of Explained Variance Ratio
Principal Component
0 6.818192e-02 0.068182
1 5.393458e-02 0.122116
2 2.960259e-02 0.151719
3 2.531159e-02 0.177031
4 2.008917e-02 0.197120
5 1.836754e-02 0.215487
6 1.733394e-02 0.232821
7 1.722199e-02 0.250043
8 1.643388e-02 0.266477
9 1.507338e-02 0.281551
10 1.127947e-02 0.292830
11 1.084462e-02 0.303675
12 1.068658e-02 0.314361
13 1.064920e-02 0.325010
14 9.802701e-03 0.334813
15 9.649799e-03 0.344463
16 9.531906e-03 0.353995
17 9.377897e-03 0.363373
18 9.235464e-03 0.372608
19 9.107859e-03 0.381716
20 9.075490e-03 0.390792
21 8.933953e-03 0.399726
22 8.845783e-03 0.408571
23 8.612122e-03 0.417183
24 8.456718e-03 0.425640
25 8.019449e-03 0.433660
26 7.702830e-03 0.441362
27 7.471530e-03 0.448834
28 6.996428e-03 0.455830
29 6.827045e-03 0.462657
30 6.552505e-03 0.469210
31 6.461755e-03 0.475672
32 6.171573e-03 0.481843
33 6.012238e-03 0.487855
34 5.923568e-03 0.493779
35 5.849192e-03 0.499628
36 5.666632e-03 0.505295
37 5.560349e-03 0.510855
38 5.443149e-03 0.516298
39 5.378249e-03 0.521677
40 5.361771e-03 0.527038
41 5.304487e-03 0.532343
... ... ...
211 9.126992e-33 1.000000
212 8.943457e-33 1.000000
213 8.464855e-33 1.000000
214 6.703384e-33 1.000000
215 5.759712e-33 1.000000
216 5.556222e-33 1.000000
217 5.163758e-33 1.000000
218 5.042528e-33 1.000000
219 5.015657e-33 1.000000
220 4.137585e-33 1.000000
221 3.973910e-33 1.000000
222 3.945219e-33 1.000000
223 3.443030e-33 1.000000
224 3.306794e-33 1.000000
225 3.126750e-33 1.000000
226 2.858518e-33 1.000000
227 2.402928e-33 1.000000
228 1.924440e-33 1.000000
229 1.619683e-33 1.000000
230 1.508341e-33 1.000000
231 1.234064e-33 1.000000
232 9.108843e-34 1.000000
233 6.484490e-34 1.000000
234 5.131683e-34 1.000000
235 4.561248e-34 1.000000
236 4.285612e-34 1.000000
237 4.243823e-34 1.000000
238 4.075687e-34 1.000000
239 4.075687e-34 1.000000
240 4.075687e-34 1.000000
241 4.075687e-34 1.000000
242 4.075687e-34 1.000000
243 4.075687e-34 1.000000
244 4.075687e-34 1.000000
245 4.075687e-34 1.000000
246 4.075687e-34 1.000000
247 4.075687e-34 1.000000
248 3.371051e-34 1.000000
249 2.026422e-34 1.000000
250 1.062580e-34 1.000000
251 3.963572e-35 1.000000
252 8.507607e-36 1.000000

253 rows × 2 columns

In [94]:
# Plot histogram of explained variance ratio as a function of PC
# with lineplot of cumsum on separate y-axis

fig, ax1 = plt.subplots()

# Explained variance ratio sequence
sns.lineplot(x=exp_var.index, y=exp_var['Explained Variance Ratio'], ax=ax1)
ax1.set_xlabel('Principal Component Number')
ax1.set(ylabel='Explained Variance Ratio')

# Make them share an x-axis but have decoupled y-axes
ax2 = ax1.twinx()

# Explained variance ratio sequence
sns.lineplot(x=exp_var.index, y=exp_var['CumSum of Explained Variance Ratio'], 
             color='red', ax=ax2)
ax2.set(ylabel='Cumulative Explained Variance Ratio')
Out[94]:
[Text(0, 0.5, 'Cumulative Explained Variance Ratio')]

Let's zoom in on that elbow in the cumulative plot...

In [95]:
exp_var.loc[125:175]
Out[95]:
Explained Variance Ratio CumSum of Explained Variance Ratio
Principal Component
125 0.003555 0.896109
126 0.003511 0.899619
127 0.003462 0.903081
128 0.003348 0.906429
129 0.003299 0.909729
130 0.003240 0.912968
131 0.003197 0.916165
132 0.003150 0.919316
133 0.003053 0.922369
134 0.003005 0.925373
135 0.002851 0.928224
136 0.002821 0.931046
137 0.002775 0.933820
138 0.002644 0.936464
139 0.002543 0.939007
140 0.002440 0.941447
141 0.002408 0.943856
142 0.002324 0.946180
143 0.002216 0.948396
144 0.002213 0.950609
145 0.002106 0.952715
146 0.002022 0.954737
147 0.001981 0.956718
148 0.001882 0.958600
149 0.001849 0.960449
150 0.001785 0.962233
151 0.001712 0.963945
152 0.001703 0.965648
153 0.001662 0.967310
154 0.001628 0.968939
155 0.001591 0.970530
156 0.001546 0.972076
157 0.001306 0.973382
158 0.001286 0.974668
159 0.001201 0.975869
160 0.001189 0.977059
161 0.001128 0.978187
162 0.001106 0.979293
163 0.001070 0.980363
164 0.001025 0.981388
165 0.000969 0.982357
166 0.000933 0.983290
167 0.000912 0.984202
168 0.000869 0.985070
169 0.000835 0.985905
170 0.000827 0.986732
171 0.000786 0.987518
172 0.000738 0.988256
173 0.000727 0.988983
174 0.000697 0.989680
175 0.000672 0.990351

It looks like we get a cumulative explained variance ratio of 90% with 128 principal components (PCs), 95% with 145 principal components, and 99% with 176 principal components. There's clearly plenty of room here to do dimensionality reduction!

As it looks like the inflection point/elbow of the cumulative explained variance curve is roughly at 140 PCs, let's retain 145 of the PCs to get a nice round(ish) 95% of the explained variance captured, meaning we're only using roughly 57% of the total feature set now. That's a pretty good reduction!

In [96]:
# Re-apply PCA to the data with 145 components retained

pca = PCA(n_components=145)
gen_pop_scaled_pca = pca.fit_transform(gen_pop_scaled)
In [97]:
# How much variance is now being explained?
exp_var = pd.DataFrame(pca.explained_variance_ratio_, columns=['Explained Variance Ratio'])
exp_var.index.name = 'Principal Component'
exp_var['CumSum of Explained Variance Ratio'] = exp_var['Explained Variance Ratio'].cumsum()

exp_var
Out[97]:
Explained Variance Ratio CumSum of Explained Variance Ratio
Principal Component
0 0.068182 0.068182
1 0.053935 0.122116
2 0.029603 0.151719
3 0.025312 0.177031
4 0.020089 0.197120
5 0.018368 0.215487
6 0.017334 0.232821
7 0.017222 0.250043
8 0.016434 0.266477
9 0.015073 0.281551
10 0.011279 0.292830
11 0.010845 0.303675
12 0.010687 0.314361
13 0.010649 0.325010
14 0.009803 0.334813
15 0.009650 0.344463
16 0.009532 0.353995
17 0.009378 0.363373
18 0.009235 0.372608
19 0.009108 0.381716
20 0.009075 0.390792
21 0.008934 0.399726
22 0.008846 0.408571
23 0.008612 0.417183
24 0.008457 0.425640
25 0.008019 0.433660
26 0.007703 0.441362
27 0.007472 0.448834
28 0.006996 0.455830
29 0.006827 0.462657
30 0.006552 0.469210
31 0.006462 0.475672
32 0.006172 0.481843
33 0.006012 0.487855
34 0.005924 0.493779
35 0.005849 0.499628
36 0.005667 0.505295
37 0.005560 0.510855
38 0.005443 0.516298
39 0.005378 0.521676
40 0.005362 0.527038
41 0.005304 0.532343
... ... ...
103 0.004223 0.808582
104 0.004217 0.812799
105 0.004212 0.817011
106 0.004210 0.821221
107 0.004196 0.825417
108 0.004188 0.829606
109 0.004169 0.833774
110 0.004151 0.837925
111 0.004125 0.842050
112 0.004115 0.846165
113 0.004100 0.850265
114 0.004053 0.854318
115 0.004014 0.858332
116 0.003971 0.862303
117 0.003936 0.866239
118 0.003897 0.870136
119 0.003856 0.873992
120 0.003783 0.877775
121 0.003743 0.881518
122 0.003722 0.885240
123 0.003660 0.888900
124 0.003637 0.892537
125 0.003555 0.896092
126 0.003510 0.899602
127 0.003461 0.903063
128 0.003347 0.906410
129 0.003298 0.909708
130 0.003237 0.912945
131 0.003197 0.916142
132 0.003150 0.919291
133 0.003048 0.922339
134 0.003002 0.925341
135 0.002848 0.928189
136 0.002820 0.931009
137 0.002768 0.933777
138 0.002640 0.936416
139 0.002531 0.938948
140 0.002437 0.941385
141 0.002399 0.943784
142 0.002305 0.946089
143 0.002211 0.948300
144 0.002207 0.950507

145 rows × 2 columns

Discussion 2.2: Perform Dimensionality Reduction

I found that the cumulative amount of variance explained by the principal components slowed down around 140 components, so I ultimately decided to use this inflection point as a guide and selected 145 components as my target for dimensionality reduction, as it managed to explain 95% of the variance of the dataset with 57% of the feature count.

Step 2.3: Interpret Principal Components

Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.

As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.

  • To investigate the features, you should map each weight to their corresponding feature name, then sort the features according to weight. The most interesting features for each principal component, then, will be those at the beginning and end of the sorted list. Use the data dictionary document to help you understand these most prominent features, their relationships, and what a positive or negative value on the principal component might indicate.
  • You should investigate and interpret feature associations from the first three principal components in this substep. To help facilitate this, you should write a function that you can call at any time to print the sorted list of feature weights, for the i-th principal component. This might come in handy in the next step of the project, when you interpret the tendencies of the discovered clusters.
In [98]:
# Map weights for the first principal component to corresponding feature names
# and then print the linked values, sorted by weight.
# HINT: Try defining a function here or in a new cell that you can reuse in the
# other cells.

def get_pca_feature_weights(pca, PC, plot=False):
    '''
    Takes a fitted PCA object and a principal component number and
    returns the weights of the original named features that comprise
    that principal component
    
    Inputs
    ------
    pca: fitted sklearn PCA object
    
    PC: int. Index of the principal component (PC) for which you want weigts
        by named original feature
        
    plot: bool. If True, plots out distribution of weights with vertical line
        at x=0 and barplot of weights by feature
        
        
    Outputs
    -------
    pandas Series with original feature names as index and weights as values
    '''
    
    pca_weights = pd.DataFrame(data=pca.components_, columns=gen_pop.columns)
    pca_weights.index.name='Principal Components'
    
    out = pca_weights.loc[PC].sort_values(ascending=False)
    out.name = f"Principal Component {PC} Weights"
    
    
    if plot:
        fig, (ax_dist, ax_bar) = plt.subplots(nrows=2)
        #fig.suptitle(f"Distribution of Weights for Principal Component {PC}")
        
        # Distribution
        sns.distplot(out, ax=ax_dist)

        # Draw vertical line at x=0 to see how centered the distribution is
        ax_dist.axvline(x=0, color='red')
        
        # Barplot
        sns.barplot(x=out.index,
                    y=out.values,
                    ax=ax_bar)

        # Draw vertical line at midpoint of distribution
        ax_bar.axvline(x=len(pca_weights.columns)/2, color='red')
        ax_bar.set(ylabel='Component Weight',
                  xlabel='Feature Name')
        
        plt.tight_layout()
    
    return out
In [99]:
# Weights for the first principal component (PC)

get_pca_feature_weights(pca,0,plot=True)
Out[99]:
PLZ8: Bins of 6-10 Family Homes                               0.171248
HH: Net Income Bins                                           0.170490
RR4: Life Stage Type - Int'l - Wealth                         0.170421
PLZ8: Bins of 10+ Family Homes                                0.165857
Community: Size (bins)                                        0.147667
PostCode: Density of HHs per km^2 (bins)                      0.144281
MoneyType__Saver                                              0.141547
Bldg: Neighborhood Quality                                    0.129913
MoneyType__Homeowner                                          0.128061
RR3: Bins of 10+ Family Homes                                 0.117980
Socioeconomic Status - HighRes__2.0                           0.117858
Birth Year                                                    0.116782
PLZ8: Bins of 3-5 Family Homes                                0.114157
RR3: Bins of 6-10 Family Homes                                0.110058
Community: Share of Unemployment                              0.109469
Generation Decade                                             0.108680
PersonalityType__Dutiful                                      0.108287
PLZ8: Primarily Business Bldgs                                0.107616
Bldg: Number of HHs                                           0.106349
RR4: Life Stage Type - LowRes__9                              0.106140
PersonalityType__Religious                                    0.103381
Community: Share of Unemployment Relative to Parent County    0.099259
PersonalityType__Traditional                                  0.097044
PersonalityType__Rational                                     0.090969
RR4: Life Stage Type - LowRes__8                              0.084737
Energy Consumption Type__5.0                                  0.083778
MoneyType__Inconspicuous                                      0.080485
MoneyType__Investor                                           0.079222
PersonalityType__Materialist                                  0.077641
PersonalityType__Family                                       0.071707
Life Stage - HighRes__2.0                                     0.069611
PersonalityType__Cultured                                     0.069448
Vacation Habits__12.0                                         0.063132
RR1: Neighborhood Type                                        0.059504
Bldg: Building Type__3.0                                      0.059055
RR4: Life Stage Type - HighRes__8A                            0.058551
Life Stage - HighRes__29.0                                    0.055280
RR4: Life Stage Type - HighRes__9B                            0.055181
RR4: Life Stage Type - HighRes__9C                            0.054632
Nationality Based on Name__2.0                                0.052917
RR4: Life Stage Type - HighRes__9D                            0.052558
Life Stage - HighRes__5.0                                     0.051004
                                                                ...   
Bldg: Location Relative to E or W Germany__W                 -0.046633
Vacation Habits__7.0                                         -0.047496
Energy Consumption Type__3.0                                 -0.047500
RR4: Life Stage Type - HighRes__2D                           -0.048354
Life Stage - HighRes__12.0                                   -0.049159
Life Stage - HighRes__20.0                                   -0.052309
Life Stage - HighRes__40.0                                   -0.054070
Life Stage - HighRes__13.0                                   -0.054211
Family Type - LowRes__5.0                                    -0.055398
HH: Length of Residency (bins)                               -0.059939
Consumption Channel Type__2.0                                -0.060020
PLZ8: Number of Cars                                         -0.060781
RR4: Life Stage Type - LowRes__3                             -0.061971
RR4: Life Stage Type - LowRes__4                             -0.068948
HH: Number of Adults in HH                                   -0.069922
Life Stage - LowRes__5.0                                     -0.071939
Life Stage - LowRes__3.0                                     -0.078880
PersonalityType__Event                                       -0.080459
RR4: Life Stage Type - LowRes__2                             -0.080999
PersonalityType__Sensual                                     -0.086933
RR1: Residential-Commercial Activity Ratio (categories)      -0.089841
PostCode: Distance to Nearest Urban Center (bins)            -0.089927
MoneyType__Primary__2.0                                      -0.090646
Life Stage - LowRes__12.0                                    -0.100729
Generation Movement                                          -0.100874
Young Environmentalist                                       -0.100874
Bldg: Rural Type__1.0                                        -0.105591
Socioeconomic Status - HighRes__9.0                          -0.106210
Socioeconomic Status - LowRes__4.0                           -0.107124
RR4: Life Stage Type - Int'l - Stage                         -0.109759
MoneyType__Preparer                                          -0.113756
Socioeconomic Status - HighRes__10.0                         -0.117575
Socioeconomic Status - LowRes__5.0                           -0.117575
Age Bin                                                      -0.119983
PostCode: Distance to City Center (bins)                     -0.120492
Bldg: Distance to Point of Sale Category                     -0.125667
PLZ8: Bin Counts of Bldgs                                    -0.128118
RR3: Bin Counts of Bldgs                                     -0.170701
PLZ8: Bins of 1-2 Family Homes                               -0.172755
RR3: Bins of 1-2 Family Homes                                -0.181674
RR1: Movement Patterns                                       -0.193011
MoneyType__Minimalist                                        -0.195638
Name: Principal Component 0 Weights, Length: 253, dtype: float64
In [100]:
# Map weights for the second principal component to corresponding feature names
# and then print the linked values, sorted by weight.

get_pca_feature_weights(pca,1,plot=True)
Out[100]:
Age Bin                                                       0.202035
MoneyType__Preparer                                           0.191114
Energy Consumption Type__3.0                                  0.181836
PersonalityType__Event                                        0.154980
Life Stage - LowRes__2.0                                      0.152410
Customer Type                                                 0.147401
PersonalityType__Sensual                                      0.139062
HH: Probability of Children in Residence                      0.129633
Life Stage - HighRes__6.0                                     0.110166
MoneyType__Homeowner                                          0.106891
PLZ8: Bins of 6-10 Family Homes                               0.095074
MoneyType__Primary__5.0                                       0.093212
PLZ8: Bins of 10+ Family Homes                                0.090055
Consumption Channel Type__2.0                                 0.090050
Community: Size (bins)                                        0.088204
PostCode: Density of HHs per km^2 (bins)                      0.087766
RR4: Life Stage Type - Int'l - Wealth                         0.083055
Life Stage - HighRes__8.0                                     0.069806
PersonalityType__Critical                                     0.067870
Community: Share of Unemployment                              0.066987
Life Stage - HighRes__5.0                                     0.066760
RR3: Bins of 10+ Family Homes                                 0.066500
MoneyType__Primary__2.0                                       0.066228
Bldg: Neighborhood Quality                                    0.066050
PLZ8: Bins of 3-5 Family Homes                                0.064465
Community: Share of Unemployment Relative to Parent County    0.062034
Shopper Type__3.0                                             0.060812
Bldg: Number of HHs                                           0.060705
HH: Net Income Bins                                           0.059460
Life Stage - HighRes__15.0                                    0.055536
Vacation Habits__4.0                                          0.054654
MoneyType__Primary__6.0                                       0.053653
PLZ8: Primarily Business Bldgs                                0.053139
RR3: Bins of 6-10 Family Homes                                0.052951
Socioeconomic Status - HighRes__3.0                           0.052455
RR4: Life Stage Type - LowRes__8                              0.052317
Life Stage - LowRes__10.0                                     0.050976
PersonalityType__Combative                                    0.050391
MoneyType__Minimalist                                         0.049361
Life Stage - HighRes__31.0                                    0.044664
HH: Length of Residency (bins)                                0.041316
Life Stage - LowRes__4.0                                      0.040500
                                                                ...   
Consumption Channel Type__4.0                                -0.048240
Life Stage - LowRes__7.0                                     -0.048263
PersonalityType__Social                                      -0.048390
Life Stage - HighRes__3.0                                    -0.048809
Life Stage - HighRes__35.0                                   -0.049691
Life Stage - HighRes__34.0                                   -0.051624
Life Stage - LowRes__8.0                                     -0.051976
Family Type - LowRes__5.0                                    -0.053979
PostCode: Distance to Nearest Urban Center (bins)            -0.055399
Bldg: Rural Type__1.0                                        -0.064113
Vacation Habits__9.0                                         -0.067587
Bldg: Distance to Point of Sale Category                     -0.068433
Life Stage - HighRes__30.0                                   -0.069529
Energy Consumption Type__5.0                                 -0.069941
Socioeconomic Status - HighRes__2.0                          -0.069942
PostCode: Distance to City Center (bins)                     -0.070866
PLZ8: Bin Counts of Bldgs                                    -0.071807
Family Type - LowRes__4.0                                    -0.072034
HH: Number of Adults in HH                                   -0.075051
MoneyType__Primary__3.0                                      -0.076669
Socioeconomic Status - HighRes__5.0                          -0.077301
Life Stage - LowRes__11.0                                    -0.079323
Life Stage - LowRes__9.0                                     -0.082328
MoneyType__Primary__4.0                                      -0.087554
RR3: Bins of 1-2 Family Homes                                -0.089342
RR3: Bin Counts of Bldgs                                     -0.091597
PLZ8: Bins of 1-2 Family Homes                               -0.094101
RR1: Movement Patterns                                       -0.096638
Energy Consumption Type__4.0                                 -0.100684
PersonalityType__Materialist                                 -0.102170
PersonalityType__Family                                      -0.104828
PersonalityType__Cultured                                    -0.138950
PersonalityType__Rational                                    -0.147749
RR1: Online Affinity                                         -0.157013
PersonalityType__Traditional                                 -0.180297
MoneyType__Investor                                          -0.182484
PersonalityType__Dutiful                                     -0.183592
PersonalityType__Religious                                   -0.186685
MoneyType__Inconspicuous                                     -0.191798
MoneyType__Saver                                             -0.193066
Generation Decade                                            -0.213852
Birth Year                                                   -0.215110
Name: Principal Component 1 Weights, Length: 253, dtype: float64
In [136]:
# Map weights for the third principal component to corresponding feature names
# and then print the linked values, sorted by weight.

get_pca_feature_weights(pca, 2, plot=True)
Out[136]:
PersonalityType__Dreamer                                      0.306002
PersonalityType__Family                                       0.250518
PersonalityType__Social                                       0.247088
PersonalityType__Cultured                                     0.239722
MoneyType__Primary__5.0                                       0.136595
MoneyType__Minimalist                                         0.129889
Young Environmentalist                                        0.126968
Generation Movement                                           0.126968
PersonalityType__Religious                                    0.123383
Socioeconomic Status - HighRes__10.0                          0.103441
Socioeconomic Status - LowRes__5.0                            0.103441
Community: Size (bins)                                        0.088516
PostCode: Density of HHs per km^2 (bins)                      0.088498
PersonalityType__Materialist                                  0.086058
Customer Type                                                 0.069569
HH: Probability of Children in Residence                      0.062160
Bldg: Neighborhood Quality                                    0.054505
PLZ8: Bins of 10+ Family Homes                                0.051079
PLZ8: Bins of 6-10 Family Homes                               0.050781
Socioeconomic Status - LowRes__3.0                            0.049332
Life Stage - HighRes__20.0                                    0.049002
PLZ8: Primarily Business Bldgs                                0.048424
Shopper Type__1.0                                             0.048405
Life Stage - HighRes__13.0                                    0.046871
Energy Consumption Type__6.0                                  0.044059
Life Stage - LowRes__5.0                                      0.043259
Life Stage - HighRes__40.0                                    0.043007
Community: Share of Unemployment Relative to Parent County    0.041103
Life Stage - HighRes__35.0                                    0.040288
Family Type - LowRes__2.0                                     0.039203
Family Type - HighRes__2.0                                    0.039203
PLZ8: Bins of 3-5 Family Homes                                0.038695
Socioeconomic Status - HighRes__3.0                           0.037949
Life Stage - HighRes__9.0                                     0.037016
Socioeconomic Status - HighRes__6.0                           0.035968
MoneyType__Primary__3.0                                       0.035742
Socioeconomic Status - HighRes__7.0                           0.035443
Community: Share of Unemployment                              0.035003
Bldg: Building Type__3.0                                      0.034799
Life Stage - LowRes__11.0                                     0.032491
Life Stage - LowRes__12.0                                     0.031834
MoneyType__Preparer                                           0.029240
                                                                ...   
MoneyType__Primary__2.0                                      -0.027493
RR4: Life Stage Type - LowRes__4                             -0.028049
Birth Year                                                   -0.029289
MoneyType__Primary__6.0                                      -0.030378
Vacation Habits__9.0                                         -0.030400
Generation Decade                                            -0.033989
RR1: Neighborhood Type                                       -0.036974
Life Stage - HighRes__23.0                                   -0.037022
Socioeconomic Status - HighRes__9.0                          -0.037154
MoneyType__Saver                                             -0.037525
Socioeconomic Status - LowRes__4.0                           -0.038101
Family Type - HighRes__3.0                                   -0.038456
Consumption Channel Type__2.0                                -0.039380
Shopper Type__3.0                                            -0.041616
PLZ8: Bins of 1-2 Family Homes                               -0.042130
RR1: Residential-Commercial Activity Ratio (categories)      -0.045312
Family Type - HighRes__5.0                                   -0.049626
Life Stage - HighRes__22.0                                   -0.049698
Socioeconomic Status - LowRes__2.0                           -0.052842
Energy Consumption Type__3.0                                 -0.053166
Bldg: Distance to Point of Sale Category                     -0.057128
RR1: Purchasing Power (bins)                                 -0.059515
Energy Consumption Type__4.0                                 -0.059832
Family Type - HighRes__4.0                                   -0.061100
Life Stage - HighRes__21.0                                   -0.062212
PostCode: Distance to Nearest Urban Center (bins)            -0.069254
Socioeconomic Status - HighRes__2.0                          -0.070305
Bldg: Rural Type__1.0                                        -0.073811
PostCode: Distance to City Center (bins)                     -0.074118
HH: Net Income Bins                                          -0.079966
MoneyType__Homeowner                                         -0.081791
Family Type - LowRes__3.0                                    -0.088158
Life Stage - LowRes__6.0                                     -0.088158
Shopper Type__2.0                                            -0.100340
Socioeconomic Status - HighRes__4.0                          -0.101537
PersonalityType__Rational                                    -0.133972
MoneyType__Investor                                          -0.149844
PersonalityType__Event                                       -0.199470
PersonalityType__Critical                                    -0.247491
PersonalityType__Dominant                                    -0.253731
PersonalityType__Combative                                   -0.290392
Gender                                                       -0.323229
Name: Principal Component 2 Weights, Length: 253, dtype: float64

Discussion 2.3: Interpret Principal Components

A few things I've noticed in these results:

  1. First of all, just to orient the conversation, we should consider large positive weights to mean that anyone with a large value in this feature would be expected to have a large value for this Principal Component (PC) and anyone with a large value for a feature that has a large negative weight would have a small value for this PC. Any features that are near zero weight basically have no impact on this PC, so they are effectively ignored by the PC.
  2. It's a bit surprising, but each principal component (PC) seems to be reasonably symmetric with its sorted weights. In other words, the most positive weight value (e.g. 0.306 for PC2) is nearly the same magnitude as the most negative weight value (e.g. -0.323 for PC2). Moreover, we see reasonably symmetric distributions of the weights for each of the first three PCs. At first I thought this might be a mathematical artifact of the covariance matrix and eigenvalue calculations done to generate the PCs in PCA, but I (found at least one example)[https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf] in which this wasn't actually the case (the most negative and most positive weights were still roughly similar in magnitude, but the distribution of the first PC in that case was not at all symmetric about zero). This may still be an artifact that is resulting from our large number of PCs and original features, but I'm not so sure.
    1. I interpret this as meaning that there is substantial amount of redundant information in the original features, causing a recurring negative correlation between the highest-magnitude features in each PC and leading to pretty symmetric distributions of feature weights on either side of zero weight, but also lots of uncorrelated and low-importance features in each PC, given the large number of zero-weight features in each PC.
  3. PC0 seems to be community-wealth- and community-housing-focused, given that the top 4 weights are all indicators of monetary resources/housing that are not at the person-level (PLZ8: Bins of 6-10 Family Homes, HH: Net Income Bins, RR4: Life Stage Type - Int'l - Wealth, and PLZ8: Bins of 10+ Family Homes). Additionally, the 5th and 6th features by weight are also community-level.
    1. The most negative weights are less wealth-focused and more housing-focused, but still seem to be in the same vein as the top most positive weights. This makes sense since they are of similar negative weight to the positive weight of top most positive ones. The top 5 negative-weight features include MoneyType__Minimalist; which is admittedly money-related (although at the person-level), RR1: Movement Patterns which likely relates to the amount of mobility in terms of where people in that region live or work; RR3: Bins of 1-2 Family Homes, PLZ8: Bins of 1-2 Family Homes, and RR3: Bin Counts of Bldgs. This suggests to me that the size and building-oriented nature of communities may be slightly more important than community wealth measures in this PC.
  4. For PC1, Age Bin, MoneyType__Preparer, and Energy Consumption Type__3.0, which makes this one tricky to interpret, as a preparer mentality for finances, a "fair supplied" attitude towards energy consumption (whatever that means), and a person's age don't seem to have any obvious overlap in terms of meaning. If we continue down the list of weights, we find PersonalityType__Event, Life Stage - LowRes__2.0, and Customer Type as the next highest positive weights. There are some indications that age may be relevant in this PC (given the presence of Age Bin and the fact that Life Stage - LowRes__2.0 includes age profiles among other things) but likely consumer behavior is more important to this PC, as MoneyType__Preparer, Energy Consumption Type__3.0, PersonalityType__Event, and Customer Type each measure in some way general or product-specific consumption patterns. This is a pretty broad interpretation, so let's see if we can narrow it down by looking at the large negative weights too.
    1. For the high-magnitude negative-weight features, we find that personality types and money types dominate the feature set at this end of the PC1 weights, but the most highly negatively weighted features both correspond to age. I found this perplexing at first, since it seemed to directly conflict with the fact that Age Bin had a high positive weight. But then I took another look at the data dictionary and my own engineered codes for Generation Decade and realized that these weights were consistent with the concept of giving older individuals a higher value in this PC (e.g. for Age Bin, higher values indicate older individuals, whereas for Birth Year and Generation Decade, higher values indicate younger individuals). Additionally, Life Stage - LowRes__2.0 (a high positive weight feature) corresponds to "single low-income and average earners of higher age".
    2. What we've learned from the high positive and high negative values here is that this PC seems to correspond to a very specific type of older individual: one who is an events-oriented spender whose spending is hard to externally influence and likes to be financially prepared for any circumstance. This seems like a pretty specific profile without more general trends, so I think the best thing to consider this PC as is "older age individuals" as far as a human-understandable heuristic description is concerned.
  5. The third principal component, PC2, is a little easier to interpret when it comes to the weight distribution, as it seems that the top 4-5 most positive and most negative values are significantly greater in magnitude than the next highest weights, giving me a clear subset of the original features to focus on in my interpretation.
    1. For the large positive weights, we see PersonalityType__Dreamer, PersonalityType__Family, PersonalityType__Social, and PersonalityType__Cultured as the most influential features. As low values for these features represent high affinities for these personality types, it seems that this PC actually penalizes individuals that have strong affinities to these personality types. Since this is less informative than it could be, let's take a look at the large negative values before trying to even begin an interpretation.
    2. For the large negative weights, we have Gender, PersonalityType__Event, PersonalityType__Critical, PersonalityType__Dominant, and PersonalityType__Combative. Again, as these features all have low scores for high affnities, this suggests that these personality types are actually dominant features in this PC and, since Gender has a value of 1 for men and 2 for women, it suggests that a strong preference in the PC is given to men.
    3. It's pretty clear from the fact that the top-most features on either end of the weight spectrum that this PC is concerned most with individual personality types. In particular, it seems like it represents individuals that are aggressive male type A personalities and undervalues more passive and collaborative females. I'm tempted to provide a caveat that male vs. female isn't as black and white as this statement suggests (because that seems too simplistic and an easy way to put bias into my analysis), but Gender is the largest of the weight magnitudes in this PC. As such, it's probably not crazy to say that the short description of this PC is gender-based (with stereotypical personality types for the two genders).

Step 3: Clustering

Step 3.1: Apply Clustering to General Population

You've assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space. In this substep, you will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster's centroid to decide on a number of clusters to keep.

  • Use sklearn's KMeans class to perform k-means clustering on the PCA-transformed data.
  • Then, compute the average difference from each point to its assigned cluster's center. Hint: The KMeans object's .score() method might be useful here, but note that in sklearn, scores tend to be defined so that larger is better. Try applying it to a small, toy dataset, or use an internet search to help your understanding.
  • Perform the above two steps for a number of different cluster counts. You can then see how the average distance decreases with an increasing number of clusters. However, each additional cluster provides a smaller net benefit. Use this fact to select a final number of clusters in which to group the data. Warning: because of the large size of the dataset, it can take a long time for the algorithm to resolve. The more clusters to fit, the longer the algorithm will take. You should test for cluster counts through at least 10 clusters to get the full picture, but you shouldn't need to test for a number of clusters above about 30.
  • Once you've selected a final number of clusters to use, re-fit a KMeans instance to perform the clustering operation. Make sure that you also obtain the cluster assignments for the general demographics data, since you'll be using them in the final Step 3.3.
In [102]:
# Over a number of different cluster counts...
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

# Max allowable cluster count
limit = 30

# % of all samples you'll randomly select for calculating silhouette score
# Necessary because silhouette is O(n^2)
sil_sample_pct = 0.01

ks = []
scores = []
sil_scores = []

# Progress bar to see where we're at
from tqdm import tqdm

for k in tqdm(range(2,limit)):
    # Try to speed this up a bit by using 75% of the available cores
    clusterer = KMeans(n_clusters=k, n_jobs=-1)

    cluster_labels = clusterer.fit_predict(gen_pop_scaled_pca)
    score_avg = clusterer.score(gen_pop_scaled_pca)
    
    # Silhouette Score accounts for both within-cluster variance
    # AND inter-cluster distance, providing higher scores
    # for clusters that are both tight and well-separated
    # (instead of just tight)
    silhouette_avg = silhouette_score(gen_pop_scaled_pca,
                                      cluster_labels,
                                      sample_size=\
                                      int(gen_pop_scaled_pca.shape[0] * sil_sample_pct))

    ks.append(k)
    scores.append(score_avg)
    sil_scores.append(silhouette_avg)
    
    print(f"For {k} clusters, the within-cluster score \
    is {score_avg} and the silhouette score is {silhouette_avg}")

k_df = pd.DataFrame({'k': ks, 
                     'Within-Cluster Score': scores,
                     'Silhouette Score': sil_scores
                    })

# Plot the scores at different values of k
fig, (ax_score, ax_sil) = plt.subplots(nrows=2)

sns.lineplot('k', 'Within-Cluster Score', data = k_df,
            ax=ax_score)
ax_score.set(xlabel="Number of Clusters")

sns.lineplot('k', 'Silhouette Score', data = k_df,
            ax=ax_sil)
ax_score.set(xlabel="Number of Clusters")

# Finds the value of k corresponding to highest within-cluster score
k_opt = int(k_df.loc[k_df['Within-Cluster Score'].idxmax()]['k'])

print(f"The optimal number of clusters using the \
within-cluster sum of squares score is {k_opt}")

# Finds the value of k corresponding to highest silhouette score
k_opt = int(k_df.loc[k_df['Silhouette Score'].idxmax()]['k'])
    
print(f"The optimal number of clusters using the \
silhouette score is {k_opt}")
  4%|▎         | 1/28 [00:58<26:11, 58.22s/it]
For 2 clusters, the within-cluster score     is -131734829.01116344 and the silhouette score is 0.061538850499472865
  7%|▋         | 2/28 [02:14<27:32, 63.56s/it]
For 3 clusters, the within-cluster score     is -127162518.95198043 and the silhouette score is 0.03984441676360387
 11%|█         | 3/28 [03:41<29:24, 70.59s/it]
For 4 clusters, the within-cluster score     is -124467039.9548405 and the silhouette score is 0.04983192297904485
 14%|█▍        | 4/28 [06:02<36:45, 91.90s/it]
For 5 clusters, the within-cluster score     is -122143923.5885773 and the silhouette score is 0.04293286195968894
 18%|█▊        | 5/28 [08:01<38:19, 99.97s/it]
For 6 clusters, the within-cluster score     is -119877915.5862476 and the silhouette score is 0.045438640327064087
 21%|██▏       | 6/28 [09:45<37:05, 101.18s/it]
For 7 clusters, the within-cluster score     is -118328230.30049567 and the silhouette score is 0.039435797066931386
 25%|██▌       | 7/28 [11:46<37:25, 106.93s/it]
For 8 clusters, the within-cluster score     is -116866391.26873097 and the silhouette score is 0.0425718577749523
 29%|██▊       | 8/28 [13:27<35:06, 105.34s/it]
For 9 clusters, the within-cluster score     is -115232550.74058668 and the silhouette score is 0.04523020489138666
 32%|███▏      | 9/28 [15:27<34:46, 109.79s/it]
For 10 clusters, the within-cluster score     is -114658710.57100543 and the silhouette score is 0.0416323721843135
 36%|███▌      | 10/28 [17:27<33:47, 112.66s/it]
For 11 clusters, the within-cluster score     is -112731330.34791721 and the silhouette score is 0.04248192559038318
 39%|███▉      | 11/28 [19:29<32:45, 115.59s/it]
For 12 clusters, the within-cluster score     is -110838706.49668743 and the silhouette score is 0.047088705528848904
 43%|████▎     | 12/28 [21:47<32:36, 122.26s/it]
For 13 clusters, the within-cluster score     is -110679550.68332115 and the silhouette score is 0.051344571649174234
 46%|████▋     | 13/28 [24:16<32:35, 130.40s/it]
For 14 clusters, the within-cluster score     is -109533009.95891854 and the silhouette score is 0.04925824593963103
 50%|█████     | 14/28 [26:43<31:33, 135.22s/it]
For 15 clusters, the within-cluster score     is -107426997.77903548 and the silhouette score is 0.056326411502293915
 54%|█████▎    | 15/28 [29:09<30:01, 138.61s/it]
For 16 clusters, the within-cluster score     is -106850460.625319 and the silhouette score is 0.05077913523041286
 57%|█████▋    | 16/28 [31:44<28:39, 143.32s/it]
For 17 clusters, the within-cluster score     is -104317924.1066365 and the silhouette score is 0.06302921111159092
 61%|██████    | 17/28 [34:56<28:59, 158.12s/it]
For 18 clusters, the within-cluster score     is -103717532.80744897 and the silhouette score is 0.05959185727589558
 64%|██████▍   | 18/28 [37:42<26:45, 160.55s/it]
For 19 clusters, the within-cluster score     is -103618020.09518385 and the silhouette score is 0.05483907852901187
 68%|██████▊   | 19/28 [40:33<24:32, 163.57s/it]
For 20 clusters, the within-cluster score     is -102240759.85826388 and the silhouette score is 0.05771796477795186
 71%|███████▏  | 20/28 [43:33<22:28, 168.52s/it]
For 21 clusters, the within-cluster score     is -100931841.03956646 and the silhouette score is 0.06718512298681723
 75%|███████▌  | 21/28 [46:55<20:50, 178.63s/it]
For 22 clusters, the within-cluster score     is -100187389.55806403 and the silhouette score is 0.06755175727851537
 79%|███████▊  | 22/28 [49:58<17:58, 179.70s/it]
For 23 clusters, the within-cluster score     is -98675049.53026673 and the silhouette score is 0.0749094696142344
 82%|████████▏ | 23/28 [53:16<15:26, 185.21s/it]
For 24 clusters, the within-cluster score     is -98808443.01895985 and the silhouette score is 0.06868918563726784
 86%|████████▌ | 24/28 [56:47<12:51, 192.91s/it]
For 25 clusters, the within-cluster score     is -97853767.37715779 and the silhouette score is 0.07582765099308388
 89%|████████▉ | 25/28 [1:00:38<10:13, 204.39s/it]
For 26 clusters, the within-cluster score     is -96525046.44782087 and the silhouette score is 0.07282888429932437
 93%|█████████▎| 26/28 [1:04:42<07:12, 216.29s/it]
For 27 clusters, the within-cluster score     is -96000271.14516047 and the silhouette score is 0.07851217459710863
 96%|█████████▋| 27/28 [1:08:51<03:46, 226.06s/it]
For 28 clusters, the within-cluster score     is -95642290.63941312 and the silhouette score is 0.07588703480384429
100%|██████████| 28/28 [1:12:31<00:00, 224.21s/it]
For 29 clusters, the within-cluster score     is -93902756.83022006 and the silhouette score is 0.08715560202170007

The optimal number of clusters using the within-cluster sum of squares score is 29
The optimal number of clusters using the silhouette score is 29

Note the choppy nature of the silhouette score. This is an unfortunate consequence of the random nature of sampling the full dataset in order to provide a manageable subset of the observations for silhouette score to calculate with.

The elbow method is a pretty subjective evaluative technique as it is typically used, so let's see if we can be a bit more numeric about it. Effectively, it is a matter of derivatives. The first derivative is the slope of these curves, $\frac{dScore}{dk}$. To look for the elbow in the score vs. k curve wherein we stop gaining as much information per increased cluster count, we can take the first derivative and look for where we transition from a high slope region to a low slope region.

So, let's look at the first derivative (or a simple approximation of it), shall we?

In [103]:
# Just doing delta-score/delta-k as the deriviatve
# Assuming delta-k = 1
k_df[['Within-Cluster Derivative / dk', 
      'Silhouette Derivative / dk']] = \
k_df.diff(periods=1)[['Within-Cluster Score',
            'Silhouette Score']]


fig, (ax_score, ax_sil) = plt.subplots(nrows=2, sharex=True)

sns.lineplot('k', 'Within-Cluster Derivative / dk', data = k_df,
            ax=ax_score)
ax_score.set(xlabel="Number of Clusters")
ax_score.axhline(color='red')

sns.lineplot('k', 'Silhouette Derivative / dk', data = k_df,
            ax=ax_sil)
ax_sil.axhline(color='red')

plt.tight_layout()

Looks like $k<10$ is probably the region of interest. Let's zoom in.

In [104]:
# Zoom-in
x_zoom = (0,10)

fig, (ax_score) = plt.subplots(nrows=1, sharex=True)

sns.lineplot('k', 'Within-Cluster Derivative / dk', data = k_df,
            ax=ax_score)
ax_score.set(xlabel="Number of Clusters", xlim=x_zoom)
ax_score.axhline(color='red')

plt.tight_layout()

It looks like our optimal value of k that minimizes the cluster count while maximizing the information we gain per cluster is $k=6$. It's the point at which we see the slope of the within-cluster score starting to flatten out, indicating that we are no longer accelerating in the amount of within-cluster variance being explained by increasing counts of clusters. We seem to have found our elbow!

In [105]:
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.

clusterer = KMeans(n_clusters=6, n_jobs=-1)
cluster_labels = clusterer.fit_predict(gen_pop_scaled_pca)
In [106]:
sns.distplot(cluster_labels)
Out[106]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1c39f390>

Discussion 3.1: Apply Clustering to General Population

I took the derivative of each score with respect to k in order to identify the k value at which we stopped being able to explain the variance of our clusters (in the case of the within-clusters score) at an accelerated rate. I defined this as the point in which the slope of the within-cluster variance score flattened out, indicating a constant rate of variance explaining preceded by an accelerated rate of the same. Based upon this interpretation of the numeric approach to the Elbow Method, a k value of 6 seems most reasonable.

Step 3.2: Apply All Steps to the Customer Data

Now that you have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters. Take care to not confuse this for re-fitting all of the models to the customer data. Instead, you're going to use the fits from the general population to clean, transform, and cluster the customer data. In the last step of the project, you will interpret how the general population fits apply to the customer data.

  • Don't forget when loading in the customers data, that it is semicolon (;) delimited.
  • Apply the same feature wrangling, selection, and engineering steps to the customer demographics using the clean_data() function you created earlier. (You can assume that the customer demographics data has similar meaning behind missing data patterns as the general demographics data.)
  • Use the sklearn objects from the general demographics data, and apply their transformations to the customers data. That is, you should not be using a .fit() or .fit_transform() method to re-fit the old objects, nor should you be creating new sklearn objects! Carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.
In [107]:
# Load in the customer demographics data and re-import feature summary data for safety's sake.

customers = pd.read_csv('data/Udacity_CUSTOMERS_Subset.csv', sep=';')
feat_info = pd.read_csv('data/AZDIAS_Feature_Summary.csv', sep=';')

customers.head()
Out[107]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GEBURTSJAHR GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ TITEL_KZ VERS_TYP ZABEOTYP ALTER_HH ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE KK_KUNDENTYP W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_BAUMAX KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 2 4 1 5.0 5 1 5 1 2 2 2 0 4.0 1 1 20.0 5.0 2.0 2.0 10.0 5.0 1 4 5.0 6 5 2 6 6 7 3 4 1 3 1 1 2 1 3 0.0 0.0 1 3 10.0 2.0 0.0 1.0 NaN 6.0 9.0 1.0 0.0 1.0 5.0 1992.0 W 7.0 1 1A 13 2.0 2.0 0.0 0.0 0.0 4.0 3.0 2.0 4.0 4.0 1.0 4.0 3.0 1.0 1201.0 3.0 3.0 1.0 0.0 1.0 5.0 5.0 1.0 2.0 1.0
1 -1 4 1 NaN 5 1 5 1 3 2 2 0 NaN 0 1 NaN NaN NaN NaN NaN NaN 1 0 NaN 3 6 2 6 7 5 3 4 1 3 3 2 4 1 3 0.0 0.0 1 3 11.0 3.0 0.0 NaN NaN 0.0 9.0 NaN NaN NaN 5.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 -1 4 2 2.0 5 1 5 1 4 4 2 0 3.0 1 2 13.0 3.0 1.0 1.0 10.0 5.0 1 4 5.0 2 2 1 3 3 7 7 1 2 7 5 6 4 1 1 0.0 0.0 2 3 6.0 1.0 0.0 1.0 NaN 6.0 9.0 1.0 0.0 8.0 1.0 1992.0 W 2.0 5 5D 34 2.0 2.0 0.0 0.0 0.0 3.0 7.0 4.0 1.0 3.0 3.0 3.0 1.0 7.0 433.0 2.0 3.0 3.0 1.0 3.0 3.0 2.0 3.0 5.0 3.0
3 1 4 1 2.0 5 1 5 2 1 2 6 0 10.0 0 2 0.0 0.0 0.0 0.0 9.0 4.0 1 1 3.0 6 5 3 4 7 5 3 4 3 3 3 3 3 4 0 0.0 0.0 1 1 8.0 0.0 0.0 4.0 NaN NaN 9.0 0.0 NaN 2.0 2.0 1992.0 W 7.0 4 4C 24 3.0 0.0 0.0 0.0 1.0 4.0 7.0 1.0 7.0 4.0 3.0 4.0 2.0 6.0 755.0 3.0 2.0 1.0 0.0 1.0 3.0 4.0 1.0 3.0 1.0
4 -1 3 1 6.0 3 1 4 4 5 2 2 1960 2.0 0 3 31.0 10.0 10.0 5.0 1.0 1.0 1 8 5.0 4 5 4 6 5 6 4 5 5 3 5 2 5 4 1 0.0 0.0 2 1 20.0 4.0 0.0 6.0 2.0 2.0 9.0 7.0 0.0 3.0 1.0 1992.0 W 3.0 7 7B 41 0.0 3.0 2.0 0.0 0.0 3.0 3.0 4.0 4.0 3.0 4.0 3.0 5.0 7.0 513.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 3.0 5.0 1.0
In [108]:
customers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191652 entries, 0 to 191651
Data columns (total 85 columns):
AGER_TYP                 191652 non-null int64
ALTERSKATEGORIE_GROB     191652 non-null int64
ANREDE_KZ                191652 non-null int64
CJT_GESAMTTYP            188439 non-null float64
FINANZ_MINIMALIST        191652 non-null int64
FINANZ_SPARER            191652 non-null int64
FINANZ_VORSORGER         191652 non-null int64
FINANZ_ANLEGER           191652 non-null int64
FINANZ_UNAUFFAELLIGER    191652 non-null int64
FINANZ_HAUSBAUER         191652 non-null int64
FINANZTYP                191652 non-null int64
GEBURTSJAHR              191652 non-null int64
GFK_URLAUBERTYP          188439 non-null float64
GREEN_AVANTGARDE         191652 non-null int64
HEALTH_TYP               191652 non-null int64
LP_LEBENSPHASE_FEIN      188439 non-null float64
LP_LEBENSPHASE_GROB      188439 non-null float64
LP_FAMILIE_FEIN          188439 non-null float64
LP_FAMILIE_GROB          188439 non-null float64
LP_STATUS_FEIN           188439 non-null float64
LP_STATUS_GROB           188439 non-null float64
NATIONALITAET_KZ         191652 non-null int64
PRAEGENDE_JUGENDJAHRE    191652 non-null int64
RETOURTYP_BK_S           188439 non-null float64
SEMIO_SOZ                191652 non-null int64
SEMIO_FAM                191652 non-null int64
SEMIO_REL                191652 non-null int64
SEMIO_MAT                191652 non-null int64
SEMIO_VERT               191652 non-null int64
SEMIO_LUST               191652 non-null int64
SEMIO_ERL                191652 non-null int64
SEMIO_KULT               191652 non-null int64
SEMIO_RAT                191652 non-null int64
SEMIO_KRIT               191652 non-null int64
SEMIO_DOM                191652 non-null int64
SEMIO_KAEM               191652 non-null int64
SEMIO_PFLICHT            191652 non-null int64
SEMIO_TRADV              191652 non-null int64
SHOPPER_TYP              191652 non-null int64
SOHO_KZ                  145056 non-null float64
TITEL_KZ                 145056 non-null float64
VERS_TYP                 191652 non-null int64
ZABEOTYP                 191652 non-null int64
ALTER_HH                 145056 non-null float64
ANZ_PERSONEN             145056 non-null float64
ANZ_TITEL                145056 non-null float64
HH_EINKOMMEN_SCORE       188684 non-null float64
KK_KUNDENTYP             79715 non-null float64
W_KEIT_KIND_HH           137910 non-null float64
WOHNDAUER_2008           145056 non-null float64
ANZ_HAUSHALTE_AKTIV      141725 non-null float64
ANZ_HH_TITEL             139542 non-null float64
GEBAEUDETYP              141725 non-null float64
KONSUMNAEHE              145001 non-null float64
MIN_GEBAEUDEJAHR         141725 non-null float64
OST_WEST_KZ              141725 non-null object
WOHNLAGE                 141725 non-null float64
CAMEO_DEUG_2015          141224 non-null object
CAMEO_DEU_2015           141224 non-null object
CAMEO_INTL_2015          141224 non-null object
KBA05_ANTG1              135672 non-null float64
KBA05_ANTG2              135672 non-null float64
KBA05_ANTG3              135672 non-null float64
KBA05_ANTG4              135672 non-null float64
KBA05_BAUMAX             135672 non-null float64
KBA05_GBZ                135672 non-null float64
BALLRAUM                 141693 non-null float64
EWDICHTE                 141693 non-null float64
INNENSTADT               141693 non-null float64
GEBAEUDETYP_RASTER       141725 non-null float64
KKK                      137392 non-null float64
MOBI_REGIO               135672 non-null float64
ONLINE_AFFINITAET        188439 non-null float64
REGIOTYP                 137392 non-null float64
KBA13_ANZAHL_PKW         140371 non-null float64
PLZ8_ANTG1               138888 non-null float64
PLZ8_ANTG2               138888 non-null float64
PLZ8_ANTG3               138888 non-null float64
PLZ8_ANTG4               138888 non-null float64
PLZ8_BAUMAX              138888 non-null float64
PLZ8_HHZ                 138888 non-null float64
PLZ8_GBZ                 138888 non-null float64
ARBEIT                   141176 non-null float64
ORTSGR_KLS9              141176 non-null float64
RELAT_AB                 141176 non-null float64
dtypes: float64(49), int64(32), object(4)
memory usage: 124.3+ MB
In [109]:
# Clean the customer data

# Use external script built using the previous version of clean_data()
# but that also includes imputation
from clean_data import enhanced_clean

customers = enhanced_clean(customers, feat_info, imp_median, imp_mode)
customers.info()
There are 0 null values in the series         Generation Designation prior to extraction
There are 0 null values in the series         Generation Designation after extraction
There are 0 null values in the series         Generation Designation prior to extraction
There are 0 null values in the series         Generation Designation after extraction
There are 0 null values in the series         Bldg: Neighborhood Quality prior to extraction
There are 0 null values in the series         Bldg: Neighborhood Quality after extraction
There are 0 null values in the series         Bldg: Neighborhood Quality prior to extraction
There are 0 null values in the series         Bldg: Neighborhood Quality after extraction
There are 0 null values in the series         PLZ8: Most Common Bldg Type prior to extraction
There are 0 null values in the series         PLZ8: Most Common Bldg Type after extraction
<class 'pandas.core.frame.DataFrame'>
Int64Index: 115737 entries, 0 to 191651
Columns: 252 entries, Age Bin to PLZ8: Primarily Business Bldgs
dtypes: float64(36), int64(28), uint8(188)
memory usage: 78.1 MB

That's weird, this has one less feature than the gen_pop dataset. Let's see what's going on here...

In [110]:
# What column is in gen_pop that isn't in customers?
gen_pop.columns[~gen_pop.columns.isin(customers.columns)]
Out[110]:
Index(['Bldg: Building Type__5.0'], dtype='object')

Looks like the Bldg: Building Type feature doesn't have a value of 5 in all of the customers dataset. We can fix this. We'll just create the proper dummy variable/feature and give it a constant value of 0, since none of the features have that value.

In [111]:
customers['Bldg: Building Type__5.0'] = 0

# Do we still have a column mismatch?
gen_pop.columns[~gen_pop.columns.isin(customers.columns)]
Out[111]:
Index([], dtype='object')

Fantastic, that did the trick!

In [112]:
# Let's see if the assumption of missing values being equivalent across
# datasets is valid...
missing = pd.DataFrame(customers.isnull().sum()).rename(columns = {0: 'total missing'})
missing['percent missing'] = round(missing['total missing'] / len(customers),2)
missing.sort_values('total missing', ascending = False, inplace = True)
missing
Out[112]:
total missing percent missing
RR3: Bin Counts of Bldgs 1 0.0
Age Bin 0 0.0
RR4: Life Stage Type - HighRes__9E 0 0.0
RR4: Life Stage Type - HighRes__4B 0 0.0
RR4: Life Stage Type - HighRes__4C 0 0.0
RR4: Life Stage Type - HighRes__4D 0 0.0
RR4: Life Stage Type - HighRes__4E 0 0.0
RR4: Life Stage Type - HighRes__5A 0 0.0
RR4: Life Stage Type - HighRes__5B 0 0.0
RR4: Life Stage Type - HighRes__5C 0 0.0
RR4: Life Stage Type - HighRes__5D 0 0.0
RR4: Life Stage Type - HighRes__5E 0 0.0
RR4: Life Stage Type - HighRes__5F 0 0.0
RR4: Life Stage Type - HighRes__6A 0 0.0
RR4: Life Stage Type - HighRes__6B 0 0.0
RR4: Life Stage Type - HighRes__6C 0 0.0
RR4: Life Stage Type - HighRes__6D 0 0.0
RR4: Life Stage Type - HighRes__6E 0 0.0
RR4: Life Stage Type - HighRes__6F 0 0.0
RR4: Life Stage Type - HighRes__7A 0 0.0
RR4: Life Stage Type - HighRes__7B 0 0.0
RR4: Life Stage Type - HighRes__7C 0 0.0
RR4: Life Stage Type - HighRes__7D 0 0.0
RR4: Life Stage Type - HighRes__7E 0 0.0
RR4: Life Stage Type - HighRes__8A 0 0.0
RR4: Life Stage Type - HighRes__8B 0 0.0
RR4: Life Stage Type - HighRes__8C 0 0.0
RR4: Life Stage Type - HighRes__8D 0 0.0
RR4: Life Stage Type - HighRes__9A 0 0.0
RR4: Life Stage Type - HighRes__9B 0 0.0
RR4: Life Stage Type - HighRes__9C 0 0.0
RR4: Life Stage Type - HighRes__4A 0 0.0
RR4: Life Stage Type - HighRes__3D 0 0.0
RR4: Life Stage Type - HighRes__3C 0 0.0
Vacation Habits__8.0 0 0.0
Family Type - HighRes__6.0 0 0.0
Family Type - HighRes__7.0 0 0.0
Family Type - HighRes__8.0 0 0.0
Family Type - HighRes__9.0 0 0.0
Family Type - HighRes__10.0 0 0.0
Family Type - HighRes__11.0 0 0.0
Family Type - HighRes__nan 0 0.0
... ... ...
Socioeconomic Status - HighRes__2.0 0 0.0
Socioeconomic Status - HighRes__3.0 0 0.0
Socioeconomic Status - HighRes__4.0 0 0.0
Socioeconomic Status - HighRes__5.0 0 0.0
Socioeconomic Status - HighRes__6.0 0 0.0
Socioeconomic Status - HighRes__7.0 0 0.0
Socioeconomic Status - HighRes__8.0 0 0.0
Socioeconomic Status - HighRes__9.0 0 0.0
Socioeconomic Status - HighRes__10.0 0 0.0
Socioeconomic Status - HighRes__nan 0 0.0
Consumption Channel Type__4.0 0 0.0
Consumption Channel Type__3.0 0 0.0
Consumption Channel Type__2.0 0 0.0
Energy Consumption Type__nan 0 0.0
Nationality Based on Name__3.0 0 0.0
Nationality Based on Name__nan 0 0.0
Shopper Type__1.0 0 0.0
Shopper Type__2.0 0 0.0
Shopper Type__3.0 0 0.0
Shopper Type__nan 0 0.0
Socioeconomic Status - LowRes__2.0 0 0.0
Socioeconomic Status - LowRes__3.0 0 0.0
Socioeconomic Status - LowRes__4.0 0 0.0
Socioeconomic Status - LowRes__5.0 0 0.0
Socioeconomic Status - LowRes__nan 0 0.0
Family Type - LowRes__2.0 0 0.0
Family Type - LowRes__3.0 0 0.0
Family Type - LowRes__4.0 0 0.0
Family Type - LowRes__5.0 0 0.0
Family Type - LowRes__nan 0 0.0
MoneyType__Primary__2.0 0 0.0
MoneyType__Primary__3.0 0 0.0
MoneyType__Primary__4.0 0 0.0
MoneyType__Primary__5.0 0 0.0
MoneyType__Primary__6.0 0 0.0
MoneyType__Primary__nan 0 0.0
Energy Consumption Type__2.0 0 0.0
Energy Consumption Type__3.0 0 0.0
Energy Consumption Type__4.0 0 0.0
Energy Consumption Type__5.0 0 0.0
Energy Consumption Type__6.0 0 0.0
Bldg: Building Type__5.0 0 0.0

253 rows × 2 columns

In [113]:
print(f"There are {customers.isnull().sum().sum()} missing values remaining in the dataset.")
There are 1 missing values remaining in the dataset.

Looks like the customer data is almost the same as the general demographic data, but not quite. We'll go ahead and remove that single record that has a missing value. Shouldn't be a big deal.

In [114]:
customers.dropna(inplace=True)

print(f"There are {customers.isnull().sum().sum()} missing values remaining in the dataset.")
There are 0 missing values remaining in the dataset.
In [115]:
# Apply feature scaling to the customer data

customers_scaled = scaler.transform(customers)
In [116]:
customers_scaled.shape
Out[116]:
(115736, 253)
In [117]:
# Do dimensionality reduction (PCA) on the customer data

customers_scaled_pca = pca.transform(customers_scaled)
customers_scaled_pca.shape
Out[117]:
(115736, 145)
In [118]:
# Label the clusters in the customer data

cluster_labels_customers = clusterer.predict(customers_scaled_pca)
In [119]:
sns.distplot(cluster_labels_customers)
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1c748390>

Step 3.3: Compare Customer Data to Demographics Data

At this point, you have clustered data based on demographics of the general population of Germany, and seen how the customer data for a mail-order sales company maps onto those demographic clusters. In this final substep, you will compare the two cluster distributions to see where the strongest customer base for the company is.

Consider the proportion of persons in each cluster for the general population, and the proportions for the customers. If we think the company's customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company's products, then we should see a mismatch from one to the other. If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster's centroid) then that suggests the people in that cluster to be a target audience for the company. On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.

Take a look at the following points in this step:

  • Compute the proportion of data points in each cluster for the general population and the customer data. Visualizations will be useful here: both for the individual dataset proportions, but also to visualize the ratios in cluster representation between groups. Seaborn's countplot() or barplot() function could be handy.
    • Recall the analysis you performed in step 1.1.3 of the project, where you separated out certain data points from the dataset if they had more than a specified threshold of missing values. If you found that this group was qualitatively different from the main bulk of the data, you should treat this as an additional data cluster in this analysis. Make sure that you account for the number of data points in this subset, for both the general population and customer datasets, when making your computations!
  • Which cluster or clusters are overrepresented in the customer dataset compared to the general population? Select at least one such cluster and infer what kind of people might be represented by that cluster. Use the principal component interpretations from step 2.3 or look at additional components to help you make this inference. Alternatively, you can use the .inverse_transform() method of the PCA and StandardScaler objects to transform centroids back to the original data space and interpret the retrieved values directly.
  • Perform a similar investigation for the underrepresented clusters. Which cluster or clusters are underrepresented in the customer dataset compared to the general population, and what kinds of people are typified by these clusters?
In [120]:
# Put cluster labels into PCA dataframes 
# so the principal component interpretations can be leveraged

# Make scaled arrays DataFrames for exploration
    # Assuming here that the features are ordered 
    # from greatest (0) to least (144) explained variance
col_names = [f"PC{i}" for i in range(0,gen_pop_scaled_pca.shape[1])]
gen_pop_scaled_pca = pd.DataFrame(gen_pop_scaled_pca, columns=col_names)
customers_scaled_pca = pd.DataFrame(customers_scaled_pca, columns=col_names)

# Add in cluster labels
gen_pop_scaled_pca['Cluster Label'] = cluster_labels
customers_scaled_pca['Cluster Label'] = cluster_labels_customers

customers_scaled_pca.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115736 entries, 0 to 115735
Columns: 146 entries, PC0 to Cluster Label
dtypes: float64(145), int32(1)
memory usage: 128.5 MB
In [121]:
# Create DataFrames that retain deleted rows for analysis
# Note that, for simplicity's sake, this is just the raw data with no cleaning or imputation
gen_pop_raw = pd.read_csv('data/Udacity_AZDIAS_Subset.csv', sep=';')
customers_raw = pd.read_csv('data/Udacity_CUSTOMERS_Subset.csv', sep=';')

gen_pop_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Data columns (total 85 columns):
AGER_TYP                 891221 non-null int64
ALTERSKATEGORIE_GROB     891221 non-null int64
ANREDE_KZ                891221 non-null int64
CJT_GESAMTTYP            886367 non-null float64
FINANZ_MINIMALIST        891221 non-null int64
FINANZ_SPARER            891221 non-null int64
FINANZ_VORSORGER         891221 non-null int64
FINANZ_ANLEGER           891221 non-null int64
FINANZ_UNAUFFAELLIGER    891221 non-null int64
FINANZ_HAUSBAUER         891221 non-null int64
FINANZTYP                891221 non-null int64
GEBURTSJAHR              891221 non-null int64
GFK_URLAUBERTYP          886367 non-null float64
GREEN_AVANTGARDE         891221 non-null int64
HEALTH_TYP               891221 non-null int64
LP_LEBENSPHASE_FEIN      886367 non-null float64
LP_LEBENSPHASE_GROB      886367 non-null float64
LP_FAMILIE_FEIN          886367 non-null float64
LP_FAMILIE_GROB          886367 non-null float64
LP_STATUS_FEIN           886367 non-null float64
LP_STATUS_GROB           886367 non-null float64
NATIONALITAET_KZ         891221 non-null int64
PRAEGENDE_JUGENDJAHRE    891221 non-null int64
RETOURTYP_BK_S           886367 non-null float64
SEMIO_SOZ                891221 non-null int64
SEMIO_FAM                891221 non-null int64
SEMIO_REL                891221 non-null int64
SEMIO_MAT                891221 non-null int64
SEMIO_VERT               891221 non-null int64
SEMIO_LUST               891221 non-null int64
SEMIO_ERL                891221 non-null int64
SEMIO_KULT               891221 non-null int64
SEMIO_RAT                891221 non-null int64
SEMIO_KRIT               891221 non-null int64
SEMIO_DOM                891221 non-null int64
SEMIO_KAEM               891221 non-null int64
SEMIO_PFLICHT            891221 non-null int64
SEMIO_TRADV              891221 non-null int64
SHOPPER_TYP              891221 non-null int64
SOHO_KZ                  817722 non-null float64
TITEL_KZ                 817722 non-null float64
VERS_TYP                 891221 non-null int64
ZABEOTYP                 891221 non-null int64
ALTER_HH                 817722 non-null float64
ANZ_PERSONEN             817722 non-null float64
ANZ_TITEL                817722 non-null float64
HH_EINKOMMEN_SCORE       872873 non-null float64
KK_KUNDENTYP             306609 non-null float64
W_KEIT_KIND_HH           783619 non-null float64
WOHNDAUER_2008           817722 non-null float64
ANZ_HAUSHALTE_AKTIV      798073 non-null float64
ANZ_HH_TITEL             794213 non-null float64
GEBAEUDETYP              798073 non-null float64
KONSUMNAEHE              817252 non-null float64
MIN_GEBAEUDEJAHR         798073 non-null float64
OST_WEST_KZ              798073 non-null object
WOHNLAGE                 798073 non-null float64
CAMEO_DEUG_2015          792242 non-null object
CAMEO_DEU_2015           792242 non-null object
CAMEO_INTL_2015          792242 non-null object
KBA05_ANTG1              757897 non-null float64
KBA05_ANTG2              757897 non-null float64
KBA05_ANTG3              757897 non-null float64
KBA05_ANTG4              757897 non-null float64
KBA05_BAUMAX             757897 non-null float64
KBA05_GBZ                757897 non-null float64
BALLRAUM                 797481 non-null float64
EWDICHTE                 797481 non-null float64
INNENSTADT               797481 non-null float64
GEBAEUDETYP_RASTER       798066 non-null float64
KKK                      770025 non-null float64
MOBI_REGIO               757897 non-null float64
ONLINE_AFFINITAET        886367 non-null float64
REGIOTYP                 770025 non-null float64
KBA13_ANZAHL_PKW         785421 non-null float64
PLZ8_ANTG1               774706 non-null float64
PLZ8_ANTG2               774706 non-null float64
PLZ8_ANTG3               774706 non-null float64
PLZ8_ANTG4               774706 non-null float64
PLZ8_BAUMAX              774706 non-null float64
PLZ8_HHZ                 774706 non-null float64
PLZ8_GBZ                 774706 non-null float64
ARBEIT                   794005 non-null float64
ORTSGR_KLS9              794005 non-null float64
RELAT_AB                 794005 non-null float64
dtypes: float64(49), int64(32), object(4)
memory usage: 578.0+ MB
In [122]:
# What rows are missing in one but not the other?
dropped_rows_cust = customers_raw[~customers_raw.index.isin(customers.index)].index
retained_rows_cust = customers_raw[customers_raw.index.isin(customers.index)].index

dropped_rows_gen = gen_pop_raw[~gen_pop_raw.index.isin(gen_pop.index)].index
retained_rows_gen = gen_pop_raw[gen_pop_raw.index.isin(gen_pop.index)].index

# Set indices for newly-created PCA-derived DataFrames so they match original DataFrame indices
gen_pop_scaled_pca.index = retained_rows_gen
customers_scaled_pca.index = retained_rows_cust

# Set rest of labels accordingly
gen_pop_raw = gen_pop_raw.join(gen_pop_scaled_pca['Cluster Label'],                 
                 how='left')

customers_raw = customers_raw.join(customers_scaled_pca['Cluster Label'],
                                  how='left')

# Set cluster labels for rows that were deleted to 6 (max label + 1)
gen_pop_raw.loc[dropped_rows_gen, 'Cluster Label'] = 6
customers_raw.loc[dropped_rows_cust, 'Cluster Label'] = 6

# Check to make sure we haven't generated a bunch of null values as a result of JOINs
temp = gen_pop_raw['Cluster Label'].isnull().sum()
print(f"Gen_pop label nulls = {temp}")

temp = customers_raw['Cluster Label'].isnull().sum()
print(f"Customer label nulls = {temp}")
Gen_pop label nulls = 0
Customer label nulls = 0

There we go! The cluster labels appear to be pushed into DataFrames I can explore now.

In [123]:
# Compare the proportion of data in each cluster for the customer data to the
# proportion of data in each cluster for the general population.
    
fig, (ax_general, ax_customer) = plt.subplots(nrows=2, sharex=True)

sns.distplot(gen_pop_raw['Cluster Label'], ax=ax_general)
ax_general.set(title="General Population Clusters",
              xlabel="", ylabel='Count')

sns.distplot(customers_raw['Cluster Label'], ax=ax_customer)
ax_customer.set(title="Customer Clusters",
              xlabel='Cluster Label', ylabel='Count')

plt.tight_layout()

Well it's certainly clear from this simple comparison that the two populations (general population and customers) are not equivalent in terms of cluster results. Let's dive a little more deeply into this and directly compare the percentages of each population present in each cluster.

Note: cluster 6 corresponds to the "fake" cluster comprised of individuals that were dropped in the preprocessing due to them having more than one missing value. Since these were not qualitatively similar to individuals with low missing value counts in every feature we spot-checked, we're treating them as a unique cluster unto themselves. This isn't a rigorous way of defining that cluster label, admittedly, but it's better than ignoring them completely!

In [124]:
# What does the % of a population's membership in each cluster look like
# across both populations?

# Customer data
cluster_counts_cust = pd.DataFrame(customers_raw['Cluster Label'].value_counts()).reset_index()
cluster_counts_cust.rename(columns={'index': 'Cluster', 'Cluster Label': 'Count'}, inplace=True)
cluster_counts_cust.sort_values('Cluster', inplace=True)

# General population data
cluster_counts_gen = pd.DataFrame(gen_pop_raw['Cluster Label'].value_counts()).reset_index()
cluster_counts_gen.rename(columns={'index': 'Cluster', 'Cluster Label': 'Count'}, inplace=True)
cluster_counts_gen.sort_values('Cluster', inplace=True)
In [125]:
# Add in % values too

cluster_counts_gen['Percentage'] = cluster_counts_gen['Count'] / cluster_counts_gen['Count'].sum()
cluster_counts_cust['Percentage'] = cluster_counts_cust['Count'] / cluster_counts_cust['Count'].sum()
In [126]:
# What does the % of a population's membership in each cluster look like
# across both populations?
# Visualize it.

fig, ax = plt.subplots()

sns.lineplot(x='Cluster', y='Percentage', data=cluster_counts_gen,
             ax=ax, label='General Population')
sns.lineplot(x='Cluster', y='Percentage',
             data=cluster_counts_cust, ax=ax, label='Customers')

ax.legend()
Out[126]:
<matplotlib.legend.Legend at 0x1a1e564630>
  1. We can see that the fraction of people in clusters 0, 4, and 5 are fairly similar between the customer and general population groups. This suggests that these clusters are not of particular interest/utility, as this equivalence suggests these groups aren't especially attracted to the company's products (which would lead to the customers being overrepresented in a given cluster) nor are they especially turned off by the products (as this would lead to an underrepresentation of customers in the cluster relative to the general population).
  2. Cluster 6 shows an overrepresentation of customers relative to the general population, but this doesn't tell us a lot sadly, as this "cluster" is comprised of those individuals for whom we had a lot of missing values and thus dropped them from our analysis. I'll discuss the features in which we saw deviance in their distribution from the non-dropped records in the final section, in an effor to somewhat interpret this group.
  3. Cluster 1 sees a significant overrepresentation of customers relative to the general population membership of that cluster, indicating that this is a sweet spot for the company, comprised of people who for some reason have an affinity for the company's products.
  4. Clusters 2 and 3 show an underrepresentation of customers relative to the general population (with 0 being the greatest), indicating that people in these clusters are less likely or fairly unlikely to be interested in the company's products.

As we're most interested in why segments of the population are particularly attracted to or repelled by the company's products, we'll focus on the latter two groups for the remainder of this analysis.

In [127]:
# What kinds of people are part of a cluster that is overrepresented in the
# customer data compared to the general population?
# Cluster 1

# Since these clusters were based off of PCA components and not the original untransformed
# features (except for cluster 6), I'll focus on the PCA components for the most part
    
# Let's first look at the first 2 principal components and how our clusters look in that space
    # as I've already provided interpretations of those components earlier

fig, (ax1_0, ax2_0, ax2_1) = plt.subplots(nrows=3, figsize = (6,10))
    
sns.scatterplot(data=customers_scaled_pca, x='PC0', y='PC1', hue='Cluster Label',
               ax=ax1_0, palette='rainbow_r', alpha = 0.5, legend=False)
ax1_0.set(xlim=(-10,10), ylim=(-10,10))

sns.scatterplot(data=customers_scaled_pca, x='PC0', y='PC2', hue='Cluster Label',
               ax=ax2_0, palette='rainbow_r', alpha = 0.5, legend='full')
ax2_0.set(xlim=(-10,10), ylim=(-10,10))

sns.scatterplot(data=customers_scaled_pca, x='PC1', y='PC2', hue='Cluster Label',
               ax=ax2_1, palette='rainbow_r', alpha = 0.5, legend=False)
ax2_1.set(xlim=(-10,10), ylim=(-10,10))

# Put the legend outside the figure
# Adapted from https://stackoverflow.com/questions/30490740/move-legend-outside-figure-in-seaborn-tsplot
ax2_0.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.tight_layout()

plt.savefig('data/PC0_1_2.png')
In [128]:
# Let's take a look at this in 3D

px.scatter_3d(customers_scaled_pca, x='PC0', y='PC1', z='PC2', color='Cluster Label',
              opacity=0.5)
In [129]:
# Use the cluster centroids for each label as a proxy
# for what the most important principal components are for each cluster

cluster_centers = pd.DataFrame(clusterer.cluster_centers_)
cluster_centers.index.name = 'Cluster Label'
cluster_centers.columns = col_names

cluster_centers
Out[129]:
PC0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC35 PC36 PC37 PC38 PC39 PC40 PC41 ... PC103 PC104 PC105 PC106 PC107 PC108 PC109 PC110 PC111 PC112 PC113 PC114 PC115 PC116 PC117 PC118 PC119 PC120 PC121 PC122 PC123 PC124 PC125 PC126 PC127 PC128 PC129 PC130 PC131 PC132 PC133 PC134 PC135 PC136 PC137 PC138 PC139 PC140 PC141 PC142 PC143 PC144
Cluster Label
0 -3.851561 -0.757068 -0.756409 -2.372630 0.620601 0.790633 -0.604790 -0.621392 -0.449303 0.369154 -0.687914 0.320012 0.394234 0.221204 0.088495 -0.040033 -0.056276 -0.026296 -0.000787 0.063082 0.079050 0.015364 0.040323 0.143896 -0.115072 -0.138607 0.002170 -0.152108 -0.081366 -0.033072 0.155014 -0.093281 -0.002391 -0.074499 0.059600 0.036976 -0.001983 -0.045617 -0.012020 0.020575 -0.083008 0.003658 ... -0.000480 0.027717 0.004866 -0.017824 -0.024463 0.002484 -0.026980 -0.007185 -0.017729 0.011736 -0.013344 0.005333 0.000484 0.032301 0.031834 0.011664 -0.004670 -0.032702 -0.029208 0.032463 0.032662 0.038254 0.009809 0.008715 -0.039076 0.025871 0.019584 0.023925 0.008586 0.013447 0.006303 -0.004194 0.010936 0.023813 0.041386 0.064786 -0.014934 0.035675 -0.013426 0.018058 -0.018114 0.043020
1 -4.496158 -0.419895 1.794418 3.321905 -0.654068 -0.018229 -0.329048 -0.067503 -0.475221 -0.434267 0.873242 -0.198040 -0.532539 -0.203347 -0.096464 -0.151192 0.379614 0.010085 0.096164 0.021400 -0.221563 -0.258769 -0.042472 -0.058139 0.148973 0.034235 0.042256 0.153007 0.015596 0.059385 0.184250 -0.021018 0.004209 0.015674 -0.051076 0.121512 0.002233 -0.033391 -0.049326 -0.057139 -0.010964 -0.011272 ... -0.018218 0.024779 -0.040376 -0.014527 0.004065 0.002514 -0.021489 -0.038961 0.036050 0.023627 -0.000997 0.013155 0.014616 -0.019816 0.020668 0.050905 -0.020891 -0.035586 -0.015042 0.023751 0.012055 -0.018821 0.013501 -0.005505 0.017191 -0.044256 0.014825 0.024617 0.012003 -0.039693 -0.023432 0.002305 -0.034643 -0.007731 0.030573 -0.018751 -0.021113 0.017821 -0.010468 0.026928 0.004334 -0.036364
2 1.484750 3.878671 0.012371 -0.285962 0.255036 -0.079022 0.609196 0.249634 0.257010 0.035681 -0.140657 0.447593 -0.405809 -0.256713 0.111176 0.133801 -0.173897 -0.065637 -0.008262 -0.109773 0.096695 0.155234 0.022999 -0.076598 0.020477 0.218642 -0.000951 0.065694 -0.023888 0.049627 -0.232850 0.145843 0.098800 -0.059459 0.070443 0.041091 -0.078768 0.012217 0.018526 -0.010330 0.015248 0.023813 ... 0.014899 -0.037859 0.034147 0.010494 0.029025 0.025912 0.035870 -0.007434 0.002871 -0.051171 -0.010724 -0.007307 -0.007438 0.036531 -0.026740 -0.050118 0.037054 0.013649 0.008034 -0.023279 -0.025707 -0.070706 -0.073877 0.007191 -0.002779 -0.042189 0.018377 -0.040158 -0.021338 0.049555 0.024845 0.050409 -0.020970 -0.015240 -0.022167 -0.023502 0.038850 -0.020892 0.006381 -0.028122 0.004675 -0.003366
3 3.790514 -2.408770 -0.011681 -0.206175 -0.894444 -0.071999 0.533309 -0.375622 -0.593248 0.106264 0.145201 -0.379585 0.209287 0.066247 -0.021715 -0.269062 0.285488 0.193957 0.060287 0.172879 -0.425077 -0.222934 0.506709 0.290568 -0.329564 -0.284851 -0.466493 -0.028242 0.302192 -0.155642 -0.103607 -0.223496 -0.027700 0.017129 -0.022567 -0.255201 -0.013910 -0.024878 -0.073492 0.032146 0.110425 -0.042242 ... -0.011682 0.016053 -0.026953 0.018476 -0.020530 -0.021722 -0.015499 0.048718 -0.001038 0.028090 0.011115 0.002548 0.015031 -0.008111 -0.033531 -0.005615 -0.008212 0.018738 0.034633 -0.054869 -0.032786 0.061794 0.079334 -0.036789 -0.004115 0.046020 -0.029313 0.027022 0.021104 -0.009393 -0.018396 -0.027816 0.040563 -0.003072 0.015877 0.000487 -0.001878 -0.005092 0.017332 0.016234 -0.008964 0.005447
4 1.575887 -3.063538 -3.215365 0.717929 -2.040417 0.723413 -5.215408 5.692177 5.873514 0.092666 -0.077453 0.007348 0.200557 -0.034846 0.090705 0.058575 -0.084665 0.099191 -0.078800 -0.189515 0.064462 0.099597 -0.121116 -0.018600 -0.097992 0.020940 0.284662 0.009457 0.038212 0.048745 -0.056889 -0.145127 -0.075501 0.271095 0.119256 0.119437 0.140226 0.083122 0.149046 0.074350 -0.127801 0.029019 ... 0.005639 0.007329 -0.013985 -0.026215 0.016171 0.010127 0.016018 -0.080810 0.006074 0.023385 0.034214 0.012600 -0.002521 -0.022568 0.049089 0.013191 -0.074018 0.035577 -0.030849 0.033083 0.032416 0.011960 -0.015419 0.008497 0.061939 0.002331 -0.035693 -0.063972 -0.021758 -0.040908 0.037403 0.005552 0.060450 -0.023406 -0.044452 -0.006183 -0.028196 0.001933 -0.013703 -0.000432 0.006206 -0.007735
5 2.590656 -4.010530 -0.411693 0.845524 2.865359 -2.301909 0.919527 -0.725089 0.426479 -0.687201 0.155808 -1.234083 1.103176 0.775143 -0.527431 0.848173 -0.983194 -0.397055 -0.383999 -0.276757 1.379773 0.654836 -1.910291 -1.012723 1.151416 0.374411 1.432798 -0.106719 -0.786921 0.264484 0.510182 0.575593 -0.319983 0.267349 -0.373582 0.239184 0.340542 0.223299 0.266469 -0.029661 -0.104006 0.044512 ... 0.018853 -0.041001 0.035653 -0.006037 0.000308 -0.061482 0.024464 0.025549 -0.048216 0.023302 0.036618 -0.033655 -0.058194 -0.182110 0.062313 0.076695 -0.032375 0.047268 -0.012482 0.122844 0.083536 0.020673 -0.004155 0.081839 0.076914 0.057026 -0.061124 -0.018235 -0.024781 -0.114997 -0.029090 -0.128334 -0.029889 0.037244 -0.144401 -0.052049 -0.055011 -0.047816 -0.014477 -0.056704 0.055449 -0.046384

6 rows × 145 columns

In [130]:
def top_cluster_coords(df, cluster_label, top_k):
    '''
    Takes information about the cluster centroid coordinates for all cluster labels 
    and returns the top_k positive and negative coordinate values for a given cluster's
    centroid. These principle components can then be interpreted in human-understandable
    terms to figure out what a reasonable description of the membership of a cluster is.
    
    Inputs
    ------
    df: pandas DataFrame with cluster labels as the index values and 
        the centroid coordinate for each feature as columns
        
    cluster_label: int. Label of the cluster you want to investigate
    
    top_k: int. Number of most positive and most negative coordinates you want returned
    
    Returns
    -------
    pandas DataFrame with a single column for the coordinates of the top positive 
        and negative principal components (PCs)
    '''

    # What are the top k most positive and top k most negative PCs according to the centroid?
    top = pd.DataFrame(pd.concat([df.loc[cluster_label].sort_values(ascending=False)[:top_k],
                    df.loc[cluster_label].sort_values(ascending=False)[-top_k:]]))
    top.rename(columns={0: 'Top PCs', 1:'Top PCs'}, inplace=True)

    return top
    
In [131]:
# What kinds of people are part of a cluster that is overrepresented in the
# customer data compared to the general population?
# Cluster 1

top_PCs_c1 = top_cluster_coords(cluster_centers, 1, 3)
top_PCs_c1
Out[131]:
Top PCs
PC3 3.321905
PC2 1.794418
PC10 0.873242
PC12 -0.532539
PC4 -0.654068
PC0 -4.496158
In [132]:
# What kinds of people are part of a cluster that is underrepresented in the
# customer data compared to the general population?

# Clusters 2 and 3
top_PCs_c2 = top_cluster_coords(cluster_centers, 2, 3)
top_PCs_c3 = top_cluster_coords(cluster_centers, 3, 3)
In [134]:
# Cluster 2
top_PCs_c2
Out[134]:
2
PC1 3.878671
PC0 1.484750
PC6 0.609196
PC13 -0.256713
PC3 -0.285962
PC12 -0.405809
In [135]:
# Cluster 3
top_PCs_c3
Out[135]:
3
PC0 3.790514
PC6 0.533309
PC22 0.506709
PC8 -0.593248
PC4 -0.894444
PC1 -2.408770
In [138]:
get_pca_feature_weights(pca, 22)
Out[138]:
Life Stage - LowRes__7.0                                   0.386765
Life Stage - HighRes__24.0                                 0.291123
Life Stage - HighRes__25.0                                 0.250563
RR4: Life Stage Type - LowRes__2                           0.204791
RR4: Life Stage Type - HighRes__2D                         0.137400
Life Stage - LowRes__11.0                                  0.106045
RR4: Life Stage Type - HighRes__2C                         0.099696
Life Stage - LowRes__12.0                                  0.092375
RR4: Life Stage Type - HighRes__2B                         0.090230
RR4: Life Stage Type - LowRes__9                           0.081828
Family Type - HighRes__6.0                                 0.074363
Shopper Type__2.0                                          0.070439
Life Stage - HighRes__35.0                                 0.065013
Health Type                                                0.064649
Life Stage - HighRes__34.0                                 0.063797
Life Stage - HighRes__39.0                                 0.055577
RR4: Life Stage Type - HighRes__9A                         0.054936
RR4: Life Stage Type - LowRes__5                           0.053487
Life Stage - HighRes__33.0                                 0.052079
Life Stage - HighRes__37.0                                 0.051391
RR4: Life Stage Type - HighRes__9B                         0.051124
RR4: Life Stage Type - HighRes__2A                         0.050495
Energy Consumption Type__6.0                               0.049688
Nationality Based on Name__2.0                             0.046293
Life Stage - LowRes__10.0                                  0.045265
MoneyType__Primary__6.0                                    0.042227
Life Stage - HighRes__11.0                                 0.040972
Life Stage - HighRes__36.0                                 0.038904
Family Type - HighRes__11.0                                0.038719
Customer Type                                              0.038031
Life Stage - HighRes__31.0                                 0.035580
RR4: Life Stage Type - HighRes__9C                         0.035157
Life Stage - HighRes__5.0                                  0.032509
RR4: Life Stage Type - HighRes__9D                         0.031233
RR4: Life Stage Type - HighRes__8A                         0.030893
RR4: Life Stage Type - HighRes__5A                         0.030738
RR1: Residential-Commercial Activity Ratio (categories)    0.030402
RR4: Life Stage Type - LowRes__8                           0.030326
RR4: Life Stage Type - HighRes__5C                         0.028636
Life Stage - HighRes__40.0                                 0.028546
RR3: Bins of 6-10 Family Homes                             0.027983
Vacation Habits__12.0                                      0.027907
                                                             ...   
RR4: Life Stage Type - LowRes__6                          -0.028412
Vacation Habits__3.0                                      -0.028554
RR4: Life Stage Type - Int'l - Stage                      -0.029883
RR4: Life Stage Type - HighRes__4E                        -0.030289
RR4: Life Stage Type - HighRes__3B                        -0.030370
RR4: Life Stage Type - HighRes__4B                        -0.030488
RR4: Life Stage Type - HighRes__9E                        -0.031198
Socioeconomic Status - HighRes__2.0                       -0.031749
Bldg: Number of HHs                                       -0.032467
Bldg: Number of Academic Title Holders                    -0.032702
RR4: Life Stage Type - Int'l - Wealth                     -0.033045
Life Stage - HighRes__8.0                                 -0.035977
RR4: Life Stage Type - HighRes__4D                        -0.036069
Bldg: Year of Building's Initial Database Entry           -0.036237
MoneyType__Inconspicuous                                  -0.037606
RR3: Bins of 10+ Family Homes                             -0.037828
PersonalityType__Sensual                                  -0.039686
Family Type - HighRes__9.0                                -0.040133
HH: Probability of Children in Residence                  -0.040534
RR4: Life Stage Type - HighRes__7E                        -0.042930
Bldg: Rural Type__2.0                                     -0.044429
RR4: Life Stage Type - LowRes__3                          -0.045584
Vacation Habits__2.0                                      -0.046451
RR4: Life Stage Type - HighRes__7D                        -0.048558
Energy Consumption Type__5.0                              -0.051505
Family Type - HighRes__8.0                                -0.057998
Shopper Type__3.0                                         -0.058026
Life Stage - HighRes__6.0                                 -0.060581
RR4: Life Stage Type - HighRes__7C                        -0.062571
RR4: Life Stage Type - HighRes__4A                        -0.064931
RR4: Life Stage Type - HighRes__4C                        -0.076403
RR4: Life Stage Type - HighRes__7A                        -0.083167
RR4: Life Stage Type - HighRes__7B                        -0.090842
RR4: Life Stage Type - LowRes__4                          -0.120267
Life Stage - HighRes__26.0                                -0.127980
Life Stage - HighRes__30.0                                -0.138471
RR4: Life Stage Type - LowRes__7                          -0.156020
Life Stage - HighRes__29.0                                -0.193085
Life Stage - HighRes__27.0                                -0.238480
Life Stage - LowRes__9.0                                  -0.240322
Life Stage - HighRes__28.0                                -0.251322
Life Stage - LowRes__8.0                                  -0.372827
Name: Principal Component 22 Weights, Length: 253, dtype: float64

Further PC Interpretations to Aid in Cluster Interpretation:

  1. Community-wealth- and community-housing-focused
  2. A specific type of older individual: one who is an events-oriented spender and whose spending is hard to externally influence. Also, these individuals like to be financially prepared for any circumstance
  3. Aggressive male type A personalities
  4. Environmentally-minded high income earners that don't own homes living in cities
  5. Two-generation household, large number of adults in the home, more than just an adult couple
  6. High-income nuclear family, with older kids (teenage to adult)
  7. Middle-income nuclear family
  8. Single parent, single teenage or grown-up child
  9. Low-income single parent
  10. No interpretation
  11. Top young earner from a multi-person household, lives in a rental
  12. Young high-income homeowners from multiperson households
  13. Urban low-income working class, live in large multi-family building near commercial properties
  14. Retired middle-income elderly people living in sparsely populated low-income area in East Germany
  15. No interpretation
  16. No interpretation
  17. No interpretation
  18. Young high-income homeowner from multi-person household, likely flagrant spender
  19. No interpretation
  20. No interpretation
  21. No interpretation
  22. No interpretation
  23. Single, low- to moderate-income renters

Discussion 3.3: Compare Customer Data to Demographics Data

First of all: cluster 6.

  • This group was comprised of records for which the amount of missing values exceed our threshold of 0-1. As such, we didn't really "cluster" on them so much as make them an effective cluster for the sake of interpretation, since we don't know much more about them other than we dropped them prior to doing our k-means clustering.
  • However, we do know a little about this group as a result of our spot-checking the distributions for mostly-whole features that existed between the dropped and un-dropped records:
    1. The primary financial type of people in cluster 6 seems to mostly be "be prepared" as opposed to the dominant type for the rest of the clusters which is "inconspicuous"
    2. The Age Bin distributions seem to indicate that cluster 6 is likely younger on average than the other clusters
    3. The Customer Type distributions are roughly similar between cluster 6 and the other clusters, but cluster 6 is more heavily "normal returner" type than the rest of the clusters (which are more dominated by the minimal returner, incentive-receptive customer type). As this customer type is pretty heavily represented in both groups, it's fair to say that cluster 6 is more biased towards this "normal returner" type than anything else.
    4. For regional online affinity, those in cluster 6 have a dominant "middle" group, whereas the rest of the clusters, in aggregate, are pretty uniformly distributed across regional online behavior types.
    5. Nature-loving vacationers appear to be a dominant group for cluster 6, whereas Vacation Habits is roughly uniformly distributed by type across the other clusters in aggregate.

Some thoughts based upon assessment of the visualization of the first three principal components:

  1. As a reminder, the first three principal components (PCs) can be described in human-understandable terms as such, based upon my earlier interpretations:

    • PC0: community-wealth- and community-housing-focused
    • PC1: a specific type of older individual: one who is an events-oriented spender and whose spending is hard to externally influence. Also, these individuals like to be financially prepared for any circumstance
    • PC2: aggressive male type A personalities
  2. Cluster 1 - company's products are popular

    • Negative values of PC0 - not community oriented
    • Even spread over narrow range of PC1
      • PC1 probably not important to this customer class
    • Mostly positive values for PC2, but seems to be non-contiguous in the PC0-1-2 space, with a smaller sub-cluster in the negative PC2 range - likely dominant personalities and maletypes
  3. Cluster 2 - company's products are unpopular
    • Mostly positive in PC0 - community oriented
    • Solidly positive in PC1 - older
    • Mostly positive in PC2 - passive personality type
  4. Cluster 3 - company's products are unpopular
    • Mostly positive values of PC0 - community oriented for the most part
    • Solidly negative values of PC1 - young
    • Nearly even spread over PC2, with positive bias
      • PC2 probably not important to this customer class

Of course, it's not reasonable to assume that these first three PCs are going to be the dominant features of every cluster, just because they explain the most variance for the aggregate dataset. So to make sure we truly capture the most important PCs to each cluster of interest, we'll take a semi-supervised approach to this problem: we'll consider the cluster label to be our target variable in a supervised learning sense, and use techniques from that branch of machine learning to determine the most important features/PCs for each cluster we care about.

One approach that would work for this is to use a supervised model (of any sort really) to predict the cluster labels and then use some post hoc technique potentially (e.g. LIME) to explain what features contributed to each predicted label. This would involve setting up a pipeline, likely doing cross-validation, scoring, etc. and would be a bit of an undertaking however. As this would be a non-binary classification task, I can't even easily utilize techniques like SelectKBest unfortunately. As such, I'm going to go with the simpler, albeit less comprehensive, approach of simply inspecting the cluster centroid

Some thoughts from the cluster centroid analysis:

  1. Cluster 1 - company's products are popular
    • Environmentally-minded high-income earners
    • Aggressive male type A personalities
    • Likely young, from a multi-generational household
    • Likely a renter
    • Not community-focused in their values
  2. Cluster 2 - company's products are unpopular
    • Older, but not retired, independent and hard-to-influence spender
    • Community-focused values
    • Middle-income member of a nuclear family
    • Likely lives in a suburban area (neither urban nor remote)
    • Not environmentally motivated
  3. Cluster 3 - company's products are unpopular
    • Older, likely not retired individuals
    • Community-focused values
    • Middle-income member of a nuclear family
    • Renter
    • Not likely to have any children in the house
    • Likely a bit impulsive in spending habits

The Final Assessment

Based upon my assessment of the cluster centroids for the groups over- and underrepresented in the customer dataset relative to the general population, I expect that this company's products would best be targeted at young renters with high incomes that care about the environment. The company would likely benefit from using advertising that focuses on environmental and/or personal benefits, but not try to use messages focused on community/societal gains.

There are also certain portions/segments of the population that are particularly not in favor of this company's products. Specifically, this includes folks focused on community/societal benefits and middle-income older folks.

In [ ]: